# Filtering the datasets

## Datasets

We are working with sentences for the following **4 languages**:
1. English, **EN**
2. German, **DE**
3. Slovenian, **SL**
4. Czech, **CZ**

The **ParaCrawl** datasets we have extracted contain the following translations of sentences:
1. CZ-EN
2. DE-EN
3. SL-EN

The **MultiParaCrawl** datasets we have extracted contain the following translations of sentences:
1. CZ-DE
2. DE-SL
3. SL-CZ

## Q&A

We are trying to answer the following questions:
> How **similar** are the sentences from the datasets we have extracted? - We have a total of **239,508** equal sentences from the three ParaCrawl datasets.

> Can we merge the ParaCrawl datasets into one big **multilingual** dataset? - **Yes**, we can, since the size of the similar sentences from the ParaCrawl datasets is big enough. The multilingual dataset will have 239,508 rows and 5 columns (ID, English, German, Czech, Slovenian).

> Will we need to use the **MultiParaCrawl** datasets as well in order to create multilingual dataset? - **No**, we will not need the three MultiParaCrawl datasets, since the size of the multilingual dataset created by merging the three ParaCrawl datasets is big enough.

> What should be the **minimum length** of the sentences? - Depends on the complexity of the language and the type of text we are dealing with, but it is good to keep sentences that are at least **5-10** words long.

> What should be the **maximum length** of the sentences? - Depends on the complexity of the language and the type of text we are dealing with, but it is good to keep sentences that are no longer than **20-25** words.

> How big will be the **size** of the filtered datasets? - Depending on the computational resources and time available for training the model, but it is recommended to have a dataset of at leas **10,000 - 20,000** to train a good paraphrasing model. After filtering, the left size of our datasets is **...**.



## Implementation

### 1. Similarity of datasets

First we will test the similar sentences from the **ParaCrawl** datasets.

#### 1.1. Load the datasets and export them as csv files

In [7]:
# Configurations
import pandas as pd
from datasets import load_dataset # conda install -c huggingface huggingface_hub -c conda-forge datasets

In [9]:
# Load the CZ-EN dataset
dataset_cz_en = load_dataset("para_crawl", "encs")

Downloading and preparing dataset para_crawl/encs to C:/Users/Acer/.cache/huggingface/datasets/para_crawl/encs/1.0.0/2e46aec87ddf3adceaf5bcd3c4bbe167c1cbb74aea5758e7821a99d63149c5ec...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/196M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/2981949 [00:00<?, ? examples/s]

Dataset para_crawl downloaded and prepared to C:/Users/Acer/.cache/huggingface/datasets/para_crawl/encs/1.0.0/2e46aec87ddf3adceaf5bcd3c4bbe167c1cbb74aea5758e7821a99d63149c5ec. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [61]:
# Convert to a DataFrame
df_cz_en = pd.DataFrame(dataset_cz_en['train']['translation'])
df_cz_en.rename(columns={'cs': 'Czech', 'en': 'English'}, inplace=True)
df_cz_en.head()

Unnamed: 0,Czech,English
0,"Wy , se také setkal s “ Příslušný stavební zko...","A wy , also met with “ by professional test ” ..."
1,"Wy, se také setkal s “Příslušný stavební zkouš...","A wy, also met with “by professional test” ima..."
2,"Wy, se také setkal s “Příslušný stavební zkouš...","A wy, also met with “by professional test” ima..."
3,"Z francouzského PIM , v 2013, to je jen 9 ostr...","Among the French PIM, in 2013, it is only 9 is..."
4,"Z francouzského PIM, v 2013, to je jen 9 ostro...","Among the French PIM, in 2013, it is only 9 is..."


In [62]:
# Load the DE-EN dataset
dataset_de_en = load_dataset("para_crawl", "ende")

Downloading and preparing dataset para_crawl/ende to C:/Users/Acer/.cache/huggingface/datasets/para_crawl/ende/1.0.0/2e46aec87ddf3adceaf5bcd3c4bbe167c1cbb74aea5758e7821a99d63149c5ec...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/1.31G [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/16264448 [00:00<?, ? examples/s]

Dataset para_crawl downloaded and prepared to C:/Users/Acer/.cache/huggingface/datasets/para_crawl/ende/1.0.0/2e46aec87ddf3adceaf5bcd3c4bbe167c1cbb74aea5758e7821a99d63149c5ec. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [63]:
# Convert to a DataFrame
df_de_en = pd.DataFrame(dataset_de_en['train']['translation'])
df_de_en.rename(columns={'de': 'German', 'en': 'English'}, inplace=True)
df_de_en.head()

Unnamed: 0,German,English
0,"La coutume veut qu’avec la nouvelle année, jed...","La coutume veut qu’avec la nouvelle année, eac..."
1,"»Das Ziel war, Schriftsteller zu werden und zu...","""The goal was , To become a writer and to be. ..."
2,"»Das Ziel war, Schriftsteller zu werden und zu...","""The goal was, To become a writer and to be. ""..."
3,"‘Ich stecke in allen Figuren’, sagt Chirbes. ‘...","'I'm in all the figures', says Chirbes. 'A lig..."
4,"‚Ich stecke in allen Figuren‘, sagt Chirbes. ‚...","'I'm in all the figures', says Chirbes. 'A lig..."


In [64]:
# Load the SL-EN dataset
dataset_sl_en = load_dataset("para_crawl", "ensl")

Downloading and preparing dataset para_crawl/ensl to C:/Users/Acer/.cache/huggingface/datasets/para_crawl/ensl/1.0.0/2e46aec87ddf3adceaf5bcd3c4bbe167c1cbb74aea5758e7821a99d63149c5ec...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/65.0M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/660161 [00:00<?, ? examples/s]

Dataset para_crawl downloaded and prepared to C:/Users/Acer/.cache/huggingface/datasets/para_crawl/ensl/1.0.0/2e46aec87ddf3adceaf5bcd3c4bbe167c1cbb74aea5758e7821a99d63149c5ec. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [65]:
# Convert to a DataFrame
df_sl_en = pd.DataFrame(dataset_sl_en['train']['translation'])
df_sl_en.rename(columns={'sl': 'Slovenian', 'en': 'English'}, inplace=True)
df_sl_en.head()

Unnamed: 0,English,Slovenian
0,"1. First, press the START (or sign in the left...",1. Najprej pritisnite tipko START (ali znak v ...
1,An anatomy of the right terror in Germany and ...,Anatomija pravi teror v Nemčiji in nesposobni ...
2,An anatomy of the right terror in Germany and ...,Anatomija pravi teror v Nemčiji in nesposobni ...
3,An anatomy of the right terror in Germany and ...,Anatomija pravi teror v Nemčiji in nesposobni ...
4,An anatomy of the right terror in Germany and ...,Anatomija pravi teror v Nemčiji in nesposobni ...


In [15]:
# Delete duplicates
print(f"Old lengths: CZ-EN({len(df_cz_en)}), DE-EN({len(df_de_en)}), SL-EN({len(df_sl_en)})")
df_cz_en.drop_duplicates(inplace=True)
df_de_en.drop_duplicates(inplace=True)
df_sl_en.drop_duplicates(inplace=True)
print(f"New lengths: CZ-EN({len(df_cz_en)}), DE-EN({len(df_de_en)}), SL-EN({len(df_sl_en)})")

Old lengths: CZ-EN(2981937), DE-EN(16264448), SL-EN(660161)
New lengths: CZ-EN(2981937), DE-EN(16264358), SL-EN(660145)


In [16]:
# Extract the CZ-EN dataset as a csv file
df_cz_en.to_csv('data/cz_en.csv', index=False)

In [17]:
# Extract the DE-EN dataset as a csv file
df_de_en.to_csv('data/de_en.csv', index=False)

In [18]:
# Extract the SL-EN dataset as a csv file
df_sl_en.to_csv('data/sl_en.csv', index=False)

#### 1.2. Find the similar sentences 

In [19]:
# Configurations
import pandas as pd

In [20]:
# Load the data
df_cz_en = pd.read_csv("data/cz_en.csv")
df_de_en = pd.read_csv("data/de_en.csv")
df_sl_en = pd.read_csv("data/sl_en.csv")

First we will find the similar sentences from the CZ-EN and SL-EN datasets

In [50]:
# Merge the CZ-EN and SL-EN datasets based on the value of the English column
df_cz_sl_en = df_sl_en.merge(df_cz_en, on="English", how="inner")
df_cz_sl_en

Unnamed: 0,English,Slovenian,Czech
0,Animation allows the skyderen gradually change...,Animacija omogoča Drsnik postopoma spreminja v...,Animace umožňuje Posuvník se postupně mění hod...
1,! Na for public exposure.,! Na javno izpostavljenost.,! Na veřejné vystavení.
2,!{Star Trek} medals are used to unlock future ...,Medalje Star Trek se uporabljajo za odklep bod...,Medaile Star Trek se používají k otevření dalš...
3,""" Egyptian mummies were made as """" built like ...",""" Egipčanske mumije so bile narejene kot """" zg...",""" Egyptských mumií byly jako """" postavený jako..."
4,""" Enablers"" , i.e. the basic building blocks w...","„potencial“ : temeljni gradniki, ki omogočajo ...","„ Předpoklady “, tj. základní stavební kameny,..."
...,...,...,...
222053,● promote diversity and increased integration ...,● spodbujanje raznolikosti in večje področje i...,● podporovat rozmanitost a větší integrace plo...
222054,● promote greater understanding between the di...,● spodbujanje boljšega razumevanja med okrožni...,● podporovat větší porozumění mezi okresních s...
222055,● strive for a long-term and stable developmen...,● si prizadevajo za dolgoročno in stabilno raz...,● usilovat o dlouhodobý a stabilní rozvoj v př...
222056,"♥ Adjusts to your playing style, perfect for B...","♥ Nastavi na vaše igranje slog, kot nalašč za ...","♥ Přizpůsobí se vašemu stylu hry, ideální pro ..."


Now, when we have the similar sentences from the CZ-EN and SL-EN datasets (222058), we can use the new dataset we created, which contains the languages SL-CZ-EN, in order to find the similar sentences with the DE-EN dataset and create the multilingual dataset.

In [52]:
# Merge the SL-CZ-EN and DE-EN datasets based on the value of the English column
df_multilingual = df_de_en.merge(df_cz_sl_en, on="English", how="inner")
df_multilingual

Unnamed: 0,German,English,Slovenian,Czech
0,Sie können das Tool von dieser Anleitung einsc...,You can download the tool from this guide incl...,"Lahko prenesete orodje iz tega priročnika, vkl...",Zde si můžete stáhnout nástroj z této příručky...
1,! Na für die Exposition der Bevölkerung.,! Na for public exposure.,! Na javno izpostavljenost.,! Na veřejné vystavení.
2,Mit Star Trek-Medaillen lassen sich zukünftige...,!{Star Trek} medals are used to unlock future ...,Medalje Star Trek se uporabljajo za odklep bod...,Medaile Star Trek se používají k otevření dalš...
3,""" Ägyptischen Mumien wurden gemacht """" gebaut ...",""" Egyptian mummies were made as """" built like ...",""" Egipčanske mumije so bile narejene kot """" zg...",""" Egyptských mumií byly jako """" postavený jako..."
4,„ Grundlagen“: grundlegende Bausteine zur Förd...,""" Enablers"" , i.e. the basic building blocks w...","„potencial“ : temeljni gradniki, ki omogočajo ...","„ Předpoklady “, tj. základní stavební kameny,..."
...,...,...,...,...
239503,"► Was enthält der Speicher ""Andere"" tatsächlich?","► What's actually included in ""other"" storage?",► Kaj pravzaprav vsebuje kategorija »drugo«?,► Co vlastně spadá do úložiště s názvem „další“?
239504,● Es gibt einfach zu bedienen und umweltfreund...,● Are user and environmentally friendly and ea...,● Enostavno za čiščenje in storitev,snadné použití a šetrné k životnímu prostředí
239505,● Es gibt einfach zu bedienen und umweltfreund...,● Are user and environmentally friendly and ea...,● Enostavno za čiščenje in storitev,● K dispozici jsou snadné použití a šetrné k ž...
239506,"♥ Passt sich Ihrem Spielstil, ideal für Anfäng...","♥ Adjusts to your playing style, perfect for B...","♥ Nastavi na vaše igranje slog, kot nalašč za ...","♥ Přizpůsobí se vašemu stylu hry, ideální pro ..."


We see that we have a total of 239,508 sentences, which are translated to all 4 languages. This size of this data is big enough to be used as one big multilingual dataset.

In [53]:
# Extract the multilingual dataset as a csv file
df_multilingual.to_csv('data/multilingual.csv', index=False)

In [57]:
# Generate new DE-EN dataset from the multilingual dataset
df_de_en_new = df_multilingual[['English', 'German']]
print(f"The size of the DE-EN dataset was reduced from {len(df_de_en)} to {len(df_de_en_new)}")
df_de_en_new.head()

The size of the DE-EN dataset was reduced from 16264358 to 239508


Unnamed: 0,English,German
0,You can download the tool from this guide incl...,Sie können das Tool von dieser Anleitung einsc...
1,! Na for public exposure.,! Na für die Exposition der Bevölkerung.
2,!{Star Trek} medals are used to unlock future ...,Mit Star Trek-Medaillen lassen sich zukünftige...
3,""" Egyptian mummies were made as """" built like ...",""" Ägyptischen Mumien wurden gemacht """" gebaut ..."
4,""" Enablers"" , i.e. the basic building blocks w...",„ Grundlagen“: grundlegende Bausteine zur Förd...


In [58]:
# Generate new CZ-EN datasets from the multilingual dataset
df_cz_en_new = df_multilingual[['English', 'Czech']]
print(f"The size of the CZ-EN dataset was reduced from {len(df_cz_en)} to {len(df_cz_en_new)}")
df_cz_en_new.head()

The size of the CZ-EN dataset was reduced from 2981937 to 239508


Unnamed: 0,English,Czech
0,You can download the tool from this guide incl...,Zde si můžete stáhnout nástroj z této příručky...
1,! Na for public exposure.,! Na veřejné vystavení.
2,!{Star Trek} medals are used to unlock future ...,Medaile Star Trek se používají k otevření dalš...
3,""" Egyptian mummies were made as """" built like ...",""" Egyptských mumií byly jako """" postavený jako..."
4,""" Enablers"" , i.e. the basic building blocks w...","„ Předpoklady “, tj. základní stavební kameny,..."


In [59]:
# Generate new SL-EN datasets from the multilingual dataset
df_sl_en_new = df_multilingual[['English', 'Slovenian']]
print(f"The size of the SL-EN dataset was reduced from {len(df_sl_en)} to {len(df_sl_en_new)}")
df_sl_en_new.head()

The size of the SL-EN dataset was reduced from 660145 to 239508


Unnamed: 0,English,Slovenian
0,You can download the tool from this guide incl...,"Lahko prenesete orodje iz tega priročnika, vkl..."
1,! Na for public exposure.,! Na javno izpostavljenost.
2,!{Star Trek} medals are used to unlock future ...,Medalje Star Trek se uporabljajo za odklep bod...
3,""" Egyptian mummies were made as """" built like ...",""" Egipčanske mumije so bile narejene kot """" zg..."
4,""" Enablers"" , i.e. the basic building blocks w...","„potencial“ : temeljni gradniki, ki omogočajo ..."


In [60]:
# Extract the new CZ-EN dataset as a csv file
df_cz_en_new.to_csv('data/cz_en_resized.csv', index=False)

In [61]:
# Extract the new SL-EN dataset as a csv file
df_sl_en_new.to_csv('data/sl_en_resized.csv', index=False)

In [62]:
# Extract the new DE-EN dataset as a csv file
df_de_en_new.to_csv('data/de_en_resized.csv', index=False)

### 2. Filtering of the data

Now, when we have the multilingual dataset and the new resized singlelingual datasets, we can filter the data and prepare it for preprocessing by choosing:
* **Filtering based on duplicates** - Filter out sentences that are duplicated.
* **Filtering based on length** - filter out sentences, which may be too short or too long for paraphrasing.

#### 2.1. Filtering the multilingual dataset

First we will start by filtering the sentences from the multilingual dataset.

In [94]:
# Configurations
import pandas as pd
from langdetect import detect # conda install -c conda-forge langdetect
from alive_progress import alive_bar # conda install -c conda-forge alive-progress
from nltk.tokenize import RegexpTokenizer

In [74]:
# Load the multilingual dataset
df_multilingual = pd.read_csv("data/multilingual.csv")
df_multilingual.head()

Unnamed: 0,German,English,Slovenian,Czech
0,Sie können das Tool von dieser Anleitung einsc...,You can download the tool from this guide incl...,"Lahko prenesete orodje iz tega priročnika, vkl...",Zde si můžete stáhnout nástroj z této příručky...
1,! Na für die Exposition der Bevölkerung.,! Na for public exposure.,! Na javno izpostavljenost.,! Na veřejné vystavení.
2,Mit Star Trek-Medaillen lassen sich zukünftige...,!{Star Trek} medals are used to unlock future ...,Medalje Star Trek se uporabljajo za odklep bod...,Medaile Star Trek se používají k otevření dalš...
3,""" Ägyptischen Mumien wurden gemacht """" gebaut ...",""" Egyptian mummies were made as """" built like ...",""" Egipčanske mumije so bile narejene kot """" zg...",""" Egyptských mumií byly jako """" postavený jako..."
4,„ Grundlagen“: grundlegende Bausteine zur Förd...,""" Enablers"" , i.e. the basic building blocks w...","„potencial“ : temeljni gradniki, ki omogočajo ...","„ Předpoklady “, tj. základní stavební kameny,..."


Lets start by filtering out the **duplicates** from the multilingual dataset for each language, since we will later create also singlelingual datasets from the multilingual.

In [75]:
# Filter out duplicates from the multilingual dataset
print(f"Old size: {len(df_multilingual)}")
df_multilingual.drop_duplicates(subset=['English'], inplace=True)
print(f"English duplicates removed: {len(df_multilingual)}")
df_multilingual.drop_duplicates(subset=['Slovenian'], inplace=True)
print(f"Slovenian duplicates removed: {len(df_multilingual)}")
df_multilingual.drop_duplicates(subset=['Czech'], inplace=True)
print(f"Czech duplicates removed: {len(df_multilingual)}")
df_multilingual.drop_duplicates(subset=['German'], inplace=True)
print(f"German duplicates removed (New size): {len(df_multilingual)}")

Old size: 239508
English duplicates removed: 103115
Slovenian duplicates removed: 101669
Czech duplicates removed: 101143
German duplicates removed (New size): 100744


In [76]:
df_multilingual

Unnamed: 0,German,English,Slovenian,Czech
0,Sie können das Tool von dieser Anleitung einsc...,You can download the tool from this guide incl...,"Lahko prenesete orodje iz tega priročnika, vkl...",Zde si můžete stáhnout nástroj z této příručky...
1,! Na für die Exposition der Bevölkerung.,! Na for public exposure.,! Na javno izpostavljenost.,! Na veřejné vystavení.
2,Mit Star Trek-Medaillen lassen sich zukünftige...,!{Star Trek} medals are used to unlock future ...,Medalje Star Trek se uporabljajo za odklep bod...,Medaile Star Trek se používají k otevření dalš...
3,""" Ägyptischen Mumien wurden gemacht """" gebaut ...",""" Egyptian mummies were made as """" built like ...",""" Egipčanske mumije so bile narejene kot """" zg...",""" Egyptských mumií byly jako """" postavený jako..."
4,„ Grundlagen“: grundlegende Bausteine zur Förd...,""" Enablers"" , i.e. the basic building blocks w...","„potencial“ : temeljni gradniki, ki omogočajo ...","„ Předpoklady “, tj. základní stavební kameny,..."
...,...,...,...,...
239502,"► Ich habe alle oben genannten Daten, die auf ...",► I've deleted all the stuff you named above t...,"► Iz telefona sem izbrisal vse vsebine, ki ste...",► Odstranil(a) jsem všechny výše uvedené polož...
239503,"► Was enthält der Speicher ""Andere"" tatsächlich?","► What's actually included in ""other"" storage?",► Kaj pravzaprav vsebuje kategorija »drugo«?,► Co vlastně spadá do úložiště s názvem „další“?
239504,● Es gibt einfach zu bedienen und umweltfreund...,● Are user and environmentally friendly and ea...,● Enostavno za čiščenje in storitev,snadné použití a šetrné k životnímu prostředí
239506,"♥ Passt sich Ihrem Spielstil, ideal für Anfäng...","♥ Adjusts to your playing style, perfect for B...","♥ Nastavi na vaše igranje slog, kot nalašč za ...","♥ Přizpůsobí se vašemu stylu hry, ideální pro ..."


Next, we will filter out sentences, which are **too long** and **too short**. That means that our sentences will be between 5 and 25 words long.
We will filter the sentences depending on the length of the English sentences!

In [77]:
# Create a tokenizer, which counts only the words
tokenizer = RegexpTokenizer(r'\w+')

# Get only the rows, which contain english sentences between 5 and 25 words long
df_multilingual = df_multilingual[
    df_multilingual["English"].apply(lambda sentence_en: len(tokenizer.tokenize(sentence_en)) <= 25 and len(tokenizer.tokenize(sentence_en)) >= 5)
]
df_multilingual

Unnamed: 0,German,English,Slovenian,Czech
0,Sie können das Tool von dieser Anleitung einsc...,You can download the tool from this guide incl...,"Lahko prenesete orodje iz tega priročnika, vkl...",Zde si můžete stáhnout nástroj z této příručky...
2,Mit Star Trek-Medaillen lassen sich zukünftige...,!{Star Trek} medals are used to unlock future ...,Medalje Star Trek se uporabljajo za odklep bod...,Medaile Star Trek se používají k otevření dalš...
3,""" Ägyptischen Mumien wurden gemacht """" gebaut ...",""" Egyptian mummies were made as """" built like ...",""" Egipčanske mumije so bile narejene kot """" zg...",""" Egyptských mumií byly jako """" postavený jako..."
4,„ Grundlagen“: grundlegende Bausteine zur Förd...,""" Enablers"" , i.e. the basic building blocks w...","„potencial“ : temeljni gradniki, ki omogočajo ...","„ Předpoklady “, tj. základní stavební kameny,..."
7,"""Ich bin sehr dankbar für diese USB Security-S...",""" I am very thankful for this USB Security sof...","""Zelo sem hvaležen za to USB varnostne program...","""Jsem velmi vděčný za tuto USB bezpečnostní so..."
...,...,...,...,...
239502,"► Ich habe alle oben genannten Daten, die auf ...",► I've deleted all the stuff you named above t...,"► Iz telefona sem izbrisal vse vsebine, ki ste...",► Odstranil(a) jsem všechny výše uvedené polož...
239503,"► Was enthält der Speicher ""Andere"" tatsächlich?","► What's actually included in ""other"" storage?",► Kaj pravzaprav vsebuje kategorija »drugo«?,► Co vlastně spadá do úložiště s názvem „další“?
239504,● Es gibt einfach zu bedienen und umweltfreund...,● Are user and environmentally friendly and ea...,● Enostavno za čiščenje in storitev,snadné použití a šetrné k životnímu prostředí
239506,"♥ Passt sich Ihrem Spielstil, ideal für Anfäng...","♥ Adjusts to your playing style, perfect for B...","♥ Nastavi na vaše igranje slog, kot nalašč za ...","♥ Přizpůsobí se vašemu stylu hry, ideální pro ..."


Now when we have filtered the whole multilingual dataset, we are left with **76,879 rows**. Let's export the data.

In [108]:
# Reset the index
df_multilingual.reset_index(drop=True, inplace=True)
df_multilingual

Unnamed: 0,German,English,Slovenian,Czech
0,Sie können das Tool von dieser Anleitung einsc...,You can download the tool from this guide incl...,"Lahko prenesete orodje iz tega priročnika, vkl...",Zde si můžete stáhnout nástroj z této příručky...
1,Mit Star Trek-Medaillen lassen sich zukünftige...,!{Star Trek} medals are used to unlock future ...,Medalje Star Trek se uporabljajo za odklep bod...,Medaile Star Trek se používají k otevření dalš...
2,""" Ägyptischen Mumien wurden gemacht """" gebaut ...",""" Egyptian mummies were made as """" built like ...",""" Egipčanske mumije so bile narejene kot """" zg...",""" Egyptských mumií byly jako """" postavený jako..."
3,„ Grundlagen“: grundlegende Bausteine zur Förd...,""" Enablers"" , i.e. the basic building blocks w...","„potencial“ : temeljni gradniki, ki omogočajo ...","„ Předpoklady “, tj. základní stavební kameny,..."
4,"""Ich bin sehr dankbar für diese USB Security-S...",""" I am very thankful for this USB Security sof...","""Zelo sem hvaležen za to USB varnostne program...","""Jsem velmi vděčný za tuto USB bezpečnostní so..."
...,...,...,...,...
76874,"► Ich habe alle oben genannten Daten, die auf ...",► I've deleted all the stuff you named above t...,"► Iz telefona sem izbrisal vse vsebine, ki ste...",► Odstranil(a) jsem všechny výše uvedené polož...
76875,"► Was enthält der Speicher ""Andere"" tatsächlich?","► What's actually included in ""other"" storage?",► Kaj pravzaprav vsebuje kategorija »drugo«?,► Co vlastně spadá do úložiště s názvem „další“?
76876,● Es gibt einfach zu bedienen und umweltfreund...,● Are user and environmentally friendly and ea...,● Enostavno za čiščenje in storitev,snadné použití a šetrné k životnímu prostředí
76877,"♥ Passt sich Ihrem Spielstil, ideal für Anfäng...","♥ Adjusts to your playing style, perfect for B...","♥ Nastavi na vaše igranje slog, kot nalašč za ...","♥ Přizpůsobí se vašemu stylu hry, ideální pro ..."


In [109]:
# Export the filtered data
df_multilingual.to_csv('data/multilingual_filtered.csv', index=False)

#### 2.2. Filtering the multilingual dataset

Now, when we have the filtered sentences from the multilingual dataset, we can again generate the three singlelingual datasets, which will also have filtered data.

In [110]:
df_multilingual_filtered = pd.read_csv('data/multilingual_filtered.csv')

In [111]:
# Generate new DE-EN dataset from the multilingual dataset
df_de_en_filtered = df_multilingual_filtered[['English', 'German']]
df_de_en_filtered.head()

Unnamed: 0,English,German
0,You can download the tool from this guide incl...,Sie können das Tool von dieser Anleitung einsc...
1,!{Star Trek} medals are used to unlock future ...,Mit Star Trek-Medaillen lassen sich zukünftige...
2,""" Egyptian mummies were made as """" built like ...",""" Ägyptischen Mumien wurden gemacht """" gebaut ..."
3,""" Enablers"" , i.e. the basic building blocks w...",„ Grundlagen“: grundlegende Bausteine zur Förd...
4,""" I am very thankful for this USB Security sof...","""Ich bin sehr dankbar für diese USB Security-S..."


In [112]:
# Generate new CZ-EN datasets from the multilingual dataset
df_cz_en_filtered = df_multilingual_filtered[['English', 'Czech']]
df_cz_en_filtered.head()

Unnamed: 0,English,Czech
0,You can download the tool from this guide incl...,Zde si můžete stáhnout nástroj z této příručky...
1,!{Star Trek} medals are used to unlock future ...,Medaile Star Trek se používají k otevření dalš...
2,""" Egyptian mummies were made as """" built like ...",""" Egyptských mumií byly jako """" postavený jako..."
3,""" Enablers"" , i.e. the basic building blocks w...","„ Předpoklady “, tj. základní stavební kameny,..."
4,""" I am very thankful for this USB Security sof...","""Jsem velmi vděčný za tuto USB bezpečnostní so..."


In [113]:
# Generate new SL-EN datasets from the multilingual dataset
df_sl_en_filtered = df_multilingual_filtered[['English', 'Slovenian']]
df_sl_en_filtered.head()

Unnamed: 0,English,Slovenian
0,You can download the tool from this guide incl...,"Lahko prenesete orodje iz tega priročnika, vkl..."
1,!{Star Trek} medals are used to unlock future ...,Medalje Star Trek se uporabljajo za odklep bod...
2,""" Egyptian mummies were made as """" built like ...",""" Egipčanske mumije so bile narejene kot """" zg..."
3,""" Enablers"" , i.e. the basic building blocks w...","„potencial“ : temeljni gradniki, ki omogočajo ..."
4,""" I am very thankful for this USB Security sof...","""Zelo sem hvaležen za to USB varnostne program..."


In [114]:
# Extract the filtered CZ-EN dataset as a csv file
df_cz_en_filtered.to_csv('data/cz_en_filtered.csv', index=False)

In [115]:
# Extract the filtered DE-EN dataset as a csv file
df_de_en_filtered.to_csv('data/de_en_filtered.csv', index=False)

In [116]:
# Extract the filtered SL-EN dataset as a csv file
df_sl_en_filtered.to_csv('data/sl_en_filtered.csv', index=False)