# Customized languague learning content
What if instead of learning words in a random order or by frequency of use, you could feed your text messages / WhatsApp conversations / emails into and AI and get personalised sentences that match exactly the way you *already speak in your mother tongue*?

Here is how I plan to make it:
1. extract sentences from social app conversations of the user
2. cluster them
3. look for the biggest clusters (i.e. most commonly expressed ideas)
4. perform parallel sentence mining from a large corpus OR machine translations (risky) of the sentences in the cluster
5. output the translation pairs to a handy format for the user (spreadsheet, CSV, HTML, Anki file...)

Here is what this notebook does:
1. use the SentenceTransformer package to create sentence embeddings
2. apply the embedder to a large corpus of parallel sentences (translated pairs)
3. use sklearn's AgglomerativeClustering to group sentence embeddings by similar topic / idea / meaning (the algorithm kind of does its own thing, I can't tell it how it should group the sentences. I'd like to focus more on the verbs than on the subject. I should try to finetune the "distance threshold" and try different metrics)
4. store the clusters and the translations of the sentences in a dataframe
5. order the dataframe by descending cluster size
6. export the dataframe as CSV

Caveats:
* The Tatoeba corpus is full of sentences starring "Tom", "John" and "Mary". I replaced their names by "kare" (he) and "kanojo" (she). Otherwise they would be clustered together regardless of what's being said in the sentence.

# Agglomerative sentence clustering

In [1]:
!pip install sentence-transformers

Collecting sentence-transformers
  Downloading sentence-transformers-2.0.0.tar.gz (85 kB)
[K     |████████████████████████████████| 85 kB 677 kB/s 
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25l- \ | done
[?25h  Created wheel for sentence-transformers: filename=sentence_transformers-2.0.0-py3-none-any.whl size=126709 sha256=b974f51a7361f5250d1f9c646dac3fb08d39c8c5b4c039c9dea004d827646de3
  Stored in directory: /root/.cache/pip/wheels/d1/c1/0f/faafd427f705c4b012274ba60d9a91d75830306811e1355293
Successfully built sentence-transformers
Installing collected packages: sentence-transformers
Successfully installed sentence-transformers-2.0.0


# Clustering sentences

https://www.sbert.net/examples/applications/clustering/README.html

Original code: https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/clustering/agglomerative.py

In [2]:
"""
This is a simple application for sentence embeddings: clustering
Sentences are mapped to sentence embeddings and then agglomerative clustering with a threshold is applied.
"""
from sentence_transformers import SentenceTransformer
from sklearn.cluster import AgglomerativeClustering
import numpy as np

embedder = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

def cluster(corpus):
    corpus_embeddings = embedder.encode(corpus)

    # Normalize the embeddings to unit length
    corpus_embeddings = corpus_embeddings /  np.linalg.norm(corpus_embeddings, axis=1, keepdims=True)

    # Perform kmean clustering
    clustering_model = AgglomerativeClustering(n_clusters=None, distance_threshold=1.5) #, affinity='cosine', linkage='average', distance_threshold=0.4)
    clustering_model.fit(corpus_embeddings)
    cluster_assignment = clustering_model.labels_ # assigned label of every sentence

    # Create a Set containing the grouped sentences
    clustered_sentences = {}
    for sentence_id, cluster_id in enumerate(cluster_assignment):
        # Create a List the first time the ID is encountered...
        if cluster_id not in clustered_sentences:
            clustered_sentences[cluster_id] = []
        # ...before appending content into it.
        clustered_sentences[cluster_id].append(corpus[sentence_id])

    for i, cluster in clustered_sentences.items():
        print("Cluster ", i+1)
        print(cluster)
        print("")
    
    print("Biggest clusters:")
    biggest_clusters = sorted( list(clustered_sentences.values()) ,key=len,reverse=True)
    for cluster in biggest_clusters:
        print(cluster)

Downloading:   0%|          | 0.00/968 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.76k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/645 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/471M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/14.8M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [3]:
# Corpus with example sentences
EN_small_corpus = ['A man is eating food.',
                   'A man is eating a piece of bread.',
                   'A man is eating pasta.',
                   'The girl is carrying a baby.',
                   'The baby is carried by the woman',
                   'A man is riding a horse.',
                   'A man is riding a white horse on an enclosed ground.',
                   'A monkey is playing drums.',
                   'Someone in a gorilla costume is playing a set of drums.',
                   'A cheetah is running behind its prey.',
                   'A cheetah chases prey on across a field.'
                   ]

cluster(EN_small_corpus)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Cluster  1
['A man is eating food.', 'A man is eating a piece of bread.', 'A man is eating pasta.']

Cluster  3
['The girl is carrying a baby.', 'The baby is carried by the woman']

Cluster  2
['A man is riding a horse.', 'A man is riding a white horse on an enclosed ground.']

Cluster  4
['A monkey is playing drums.', 'Someone in a gorilla costume is playing a set of drums.']

Cluster  5
['A cheetah is running behind its prey.', 'A cheetah chases prey on across a field.']

Biggest clusters:
['A man is eating food.', 'A man is eating a piece of bread.', 'A man is eating pasta.']
['The girl is carrying a baby.', 'The baby is carried by the woman']
['A man is riding a horse.', 'A man is riding a white horse on an enclosed ground.']
['A monkey is playing drums.', 'Someone in a gorilla costume is playing a set of drums.']
['A cheetah is running behind its prey.', 'A cheetah chases prey on across a field.']


In [4]:
# Corpus with example sentences
FR_small_corpus = ['Un homme mange du fromage.',
                   'Cet homme fait du cheval depuis 6 ans.',
                   'Les enfants mangent avec leurs parents.',
                   'Il fait du cheval depuis 6 mois.',
                   'Comment ça va ?',
                   'Une femme mange du fromage.',
                   'Cette femme attend un enfant.',
                   'Elle appréhende le jour de l\'accouchement...',
                   'Tu vas bien ?',
                   'Comment tu te sens ?',
                   'Ça va ?.',
                   ]
cluster(FR_small_corpus)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Cluster  1
['Un homme mange du fromage.', 'Les enfants mangent avec leurs parents.', 'Une femme mange du fromage.']

Cluster  3
['Cet homme fait du cheval depuis 6 ans.', 'Il fait du cheval depuis 6 mois.']

Cluster  4
['Comment ça va ?', 'Tu vas bien ?', 'Comment tu te sens ?', 'Ça va ?.']

Cluster  2
['Cette femme attend un enfant.', "Elle appréhende le jour de l'accouchement..."]

Biggest clusters:
['Comment ça va ?', 'Tu vas bien ?', 'Comment tu te sens ?', 'Ça va ?.']
['Un homme mange du fromage.', 'Les enfants mangent avec leurs parents.', 'Une femme mange du fromage.']
['Cet homme fait du cheval depuis 6 ans.', 'Il fait du cheval depuis 6 mois.']
['Cette femme attend un enfant.', "Elle appréhende le jour de l'accouchement..."]


## Testing on English-to-Japanese corpus

In [5]:
import pandas as pd
import re
filename = "/kaggle/input/english-to-japanese-50k-sentences/jpn.txt"
df = pd.read_csv(filename,sep='\t',header=None,names=['EN','JP','Metadata'])
df = df.drop(columns='Metadata')
for name,pronoun in {'トム|ジョン':'彼','メアリー':'彼女'}.items(): 
    df['JP'] = df['JP'].apply(lambda x: re.sub(name+'(?!さん|くん)',pronoun,x)) # negative lookaheads
df.head()

Unnamed: 0,EN,JP
0,Go.,行け。
1,Go.,行きなさい。
2,Hi.,こんにちは。
3,Hi.,もしもし。
4,Hi.,やっほー。


In [6]:
# df_sample = df.sample(20000) # random
df_sample = df[0:100].copy()
JP_subcorpus = list(df_sample['JP'])

print(JP_subcorpus[:200]) # show the 200 first

['行け。', '行きなさい。', 'こんにちは。', 'もしもし。', 'やっほー。', 'こんにちは！', '走れ。', '走って！', '誰？', 'すごい！', 'ワォ！', 'わぉ！', 'おー！', '火事だ！', '火事！', '撃て！', '助けて！', '助けてくれ！', '飛び越えろ！', '跳べ！', '飛び降りろ！', '飛び跳ねて！', 'ジャンプして！', '跳べ！', '飛び跳ねて！', 'ジャンプして！', 'やめろ！', '止まれ！', '待って！', '続けて。', '進んで。', '進め。', '続けろ。', 'こんにちは。', 'もしもし。', 'こんにちは！', '急げ！', 'なるほど。', 'なるほどね。', 'わかった。', 'わかりました。', 'そうですか。', 'そうなんだ。', 'そっか。', '頑張ってみる。', 'やってみる。', '試してみる。', 'やってみよう！', 'トライしてみる。', '俺の勝ちー！', '勝ったぁ！', '勝ったぞ！', '私の勝ち！', '私が勝ち！', 'なんてこった！', 'なんてことだ！', 'しまった！', 'あー、しまった！', 'うわ、しまった！', '何てことだ！', '落ち着いて。', 'くつろいで。', 'リラックスして。', '楽にしてください。', '撃て！', 'はい、チーズ。', 'にっこり笑って。', '乾杯！', '動くな！', '起きなさい！', '起きなさい。', '起きろ！', 'さあ、行っといで。', '捕まえた。', '分かった！', '彼は走った。', '彼が走った。', '乗れよ。', 'さあ乗って。', '抱きしめて。', 'ぎゅーして。', '分かってる。', '分かってます。', '出発した', '負けた・・・。', '払いました。', '私、辞めます。', 'やめた。', '１９歳です。', '大丈夫ですよ。', '私は大丈夫です。', '起きてるよ。', '聞きなさい。', '聞いて！', '馬鹿な！', 'あり得ねぇー。', 'とんでもない！', 'とんでもございません！', 'とんでもありません！', '本当？']


In [7]:
# cluster(JP_subcorpus)

In [8]:
JP_corpus_embeddings = embedder.encode(JP_subcorpus)

# Normalize the embeddings to unit length
JP_corpus_embeddings = JP_corpus_embeddings /  np.linalg.norm(JP_corpus_embeddings, axis=1, keepdims=True)

# Perform kmean clustering
clustering_model = AgglomerativeClustering(n_clusters=None, distance_threshold=1) #, affinity='cosine', linkage='average', distance_threshold=0.4)
clustering_model.fit(JP_corpus_embeddings)
cluster_assignment = clustering_model.labels_ # assigned label of every sentence

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

In [9]:
df_sample['Labels'] = pd.Series(cluster_assignment)
# df_sample.head(10)

In [10]:
aggregated = df_sample.groupby('Labels').aggregate(set)
aggregated['Size'] = aggregated['JP'].apply(len)
aggregated = aggregated.sort_values('Size',ascending=False)
aggregated

Unnamed: 0_level_0,EN,JP,Size
Labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2,"{I left., Go now., Go., Go on., Hop in.}","{さあ乗って。, 行け。, 出発した, 続けろ。, 乗れよ。, さあ、行っといで。, 行きな...",10
5,"{No way!, Oh no!}","{あー、しまった！, うわ、しまった！, なんてことだ！, 何てことだ！, しまった！, な...",7
3,"{Hurry!, Hug me., Cheers!, Wait!, Help!}","{助けてくれ！, 乾杯！, 急げ！, ぎゅーして。, 助けて！, 待って！}",6
15,"{Jump!, Jump.}","{ジャンプして！, 飛び跳ねて！, 飛び降りろ！, 跳べ！, 飛び越えろ！}",5
6,"{Got it!, I'm OK., I see.}","{わかった。, 分かった！, 私は大丈夫です。, 大丈夫ですよ。, わかりました。}",5
8,{I try.},"{試してみる。, やってみる。, やってみよう！, トライしてみる。, 頑張ってみる。}",5
10,{I won!},"{俺の勝ちー！, 私が勝ち！, 勝ったぞ！, 私の勝ち！, 勝ったぁ！}",5
4,"{I'm up., Get up.}","{起きろ！, 起きてるよ。, 起きなさい！, 起きなさい。}",4
7,{No way!},"{とんでもございません！, とんでもありません！, あり得ねぇー。, とんでもない！}",4
9,"{Hello!, Hi.}","{もしもし。, やっほー。, こんにちは。, こんにちは！}",4


In [11]:
aggregated.to_csv(f'SentenceTransformers_Clustering_EN_JP_{len(df_sample)}.csv')