In [1]:
# pre-requisite = generate the raw file of sentence pairs in your language combination from https://tatoeba.org/en/downloads. Go to 'Sentence pairs' then hit 'Download sentence pairs'.
import pandas as pd

In [2]:
# read the Tatoeba .tsv file into Pandas. This is by default a Farsi to English one.
raw_sentence_pairs = pd.read_table('sentence-pairs-fa-en.tsv', header=None)

In [21]:
# optional - inspect a sample of the data
raw_sentence_pairs.head()

Unnamed: 0_level_0,sourceLanguage,English,sentence_length,sentence_length_mean_centered,quantiles
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
401635,خیلی ها از تبلیغات فریب خوردند.,Many people were deceived by the advertisement.,47,10.530642,complex
401637,جمعیت زیادی از مردم در جشن حاضر بودند.,A crowd of people were present at a party.,42,5.530642,complex
401640,شمار زیادی از مردم از سرتاسر کشور آمده اند.,Numbers of people came from all over the country.,49,12.530642,most_complex
401648,خیلی ها در صف منتظر بودند.,Many people were waiting in line.,33,-3.469358,medium
401659,خیلی ها به من پیشنهاد رفتن به تعطیلات را داده ...,A good many people have told me to take a holi...,50,13.530642,most_complex


In [4]:
# drop the English sentence identifier column. The source language identifier column will be used instead. It will also be used as an index for the whole row.
raw_sentence_pairs.drop(columns=2, inplace=True)

In [5]:
# rename the columns and set the index as the ID
raw_sentence_pairs.columns = ['Id', 'sourceLanguage', 'English']
raw_sentence_pairs.set_index('Id', inplace=True)

In [6]:
# remove duplicate sentence pairs
raw_sentence_pairs.drop_duplicates(subset=['sourceLanguage','English'], inplace=True)
raw_sentence_pairs.drop_duplicates(subset=['English'], inplace=True)

Create a new dataframe where you will put the sentence pairs you want to export to Anki
- The first sentence pairs that will go into this new dataframe will be from the keyword filter below
- This new dataframe will be called export_sentence_pairs


Narrowing the dataset by keywords
- This can help you focus on the areas of your target language where you are weakest. e.g. keywords that trigger a subjunctive clause, specific vocabulary, keywords that indicate questions
- This could also contain a vocab list of terms you want to learn
- Each keyword must be separated by '|'. I have put some default values in Farsi. 
- If you would rather filter the dataset by English words, swap 'sourceLanguage' for 'English'


In [7]:
export_sentence_pairs = raw_sentence_pairs[raw_sentence_pairs['sourceLanguage'].str.contains("گذر|گشت")]

Finding sentences by the right complexity
- a simple heuristic is used to determine this: the length of the sentence in characters. This assumes that the shorter the sentence is, the more likely it is to be useful to a language learner who is at an elementary level. The longer the sentence is, the more likely it is to be useful to a more advanced language learner. 
- for example, an advanced language learner can exclude simple sentences like 'Good afternoon.'
- this is not an optimal measure of complexity, but it will help any language learner narrow down the original dataset to a more manageable/useful size. 

In [8]:
# create a new column containing the length of each sentence (by default in English, but you can change this for your target language).
raw_sentence_pairs['sentence_length'] = raw_sentence_pairs['English'].str.len()

In [9]:
# mean centre this 
average_sentence_length = raw_sentence_pairs['sentence_length'].mean()
raw_sentence_pairs['sentence_length_mean_centered'] = raw_sentence_pairs['sentence_length'] - average_sentence_length

In [10]:
# make into quantiles based on complexity of the sentence. Here, 5 quantiles have been chosen, but this is flexible. If you change this, add the same number of labels.
raw_sentence_pairs['quantiles'] = pd.qcut(raw_sentence_pairs['sentence_length_mean_centered'], 5, labels=["most_simple", "simple", "medium", "complex", "most_complex"])

In [11]:
# isolate the simple and medium complexity sentences
raw_sentence_pairs[raw_sentence_pairs['quantiles'].str.match("simple|complex")]

Unnamed: 0_level_0,sourceLanguage,English,sentence_length,sentence_length_mean_centered,quantiles
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
401635,خیلی ها از تبلیغات فریب خوردند.,Many people were deceived by the advertisement.,47,10.530642,complex
401637,جمعیت زیادی از مردم در جشن حاضر بودند.,A crowd of people were present at a party.,42,5.530642,complex
401977,من و او(زن) هم سن هستیم.,She and I are the same age.,27,-9.469358,simple
401981,او وانمود کرد که مرا نمی شنود.,He pretended not to hear me.,28,-8.469358,simple
402865,دکتر خوبی کسی است که به راهنمایی های خودش عمل ...,A good doctor follows his own directions.,41,4.530642,complex
...,...,...,...,...,...
10724347,من به جادو معتقد نیستم.,I don't believe in magic.,25,-11.469358,simple
10728767,او هنگام تلفن همیشه مجمل بود.,He was always terse on the telephone.,37,0.530642,complex
10728863,شما یک سالوس دورو هستید.,You're a two-faced hypocrite.,29,-7.469358,simple
10730038,دکترها گلوله را در آوردند.,Doctors removed the bullet.,27,-9.469358,simple


In [12]:
# optional - see the distribution of the quantiles before deciding which you want
raw_sentence_pairs['quantiles'].value_counts(ascending=False)

most_simple     1173
simple          1093
complex         1093
most_complex    1080
medium          1060
Name: quantiles, dtype: int64

In [13]:
# add the sentences of the level of complexity you are interested in to a new dataframe which will later be used to merge this data with the export dataframe
additions_by_complexity = raw_sentence_pairs[raw_sentence_pairs['quantiles'].str.contains("simple|medium")]

In [None]:
# remove irrelevant columns before merging 
additions_by_complexity.drop(columns=['sentence_length','sentence_length_mean_centered','quantiles'],inplace=True)

In [16]:
# merge the new additions into the export dataframe
export_sentence_pairs = pd.concat([export_sentence_pairs, additions_by_complexity],ignore_index=True)

In [17]:
# remove duplicates
export_sentence_pairs.drop_duplicates(subset=['English'], inplace=True)

In [None]:
# the last step is to export the final dataframe to a .csv file. The default encoding will be utf-8, which is what Anki requires. header=False ensures the column names won't be added as a card in Anki too. Finally, to get this into Anki, simply go to 'import file' in the Anki application and select 'anki_import.csv'
export_sentence_pairs.to_csv('anki_import.csv',index=False, index_label=False, header=False)