Hi!
Today we will play with German words!

A computer can't understand 'words' as we do, it can only understand numbers. So we have to translate words into numbers (vectors). 

There are a lot of ways to do this. A simple one is just to number them: 'der' -> 1, 'die' -> 2, etc.  But there are better ways - one of them is to give similar words similar numbers.

Then we can group similar words together, see how close they are, and other interesting things. 

This is used, for example, when you Google something, and the results will also have words similar to the one you wrote ('doggy' -> 'dog', etc.) .

Today you will:
- Look at some German words and how close they are together
- Add your own words!
- See if there are groups of similar words, give them names

If you think this is interesting and want more:
- Find the 5 most similar words to another word
- See how similar two words are

Above you can see "cells". To execute one cell press Shift+Enter.
You can execute cells so many times that you need.



In [None]:
#nothing to do here, just execute (Shift + Enter)
!pip install wget

print('Erfolg!')

In [None]:
#nothing to do here, just execute (Shift + Enter)

import pandas as pd
import ast
import numpy as np
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE
import pickle
import wget
import os

print('Erfolg!')

In [None]:
#nothing to do here, just execute (Shift + Enter)
#executing of this cell can take some time ~1-2min

import requests

url_vectors = 'https://int-emb-word2vec-de-wiki.s3.eu-central-1.amazonaws.com/vectors.txt'
vectors_filename = 'vectors.txt'
if not os.path.exists(vectors_filename):
    r = requests.get(url_vectors, allow_redirects=True)
    open(vectors_filename, 'wb').write(r.content)

url_model = 'https://serhii.net/s/german.model'
model_filename = 'german.model'
if not os.path.exists(model_filename):
    r = requests.get(url_model, allow_redirects=True)
    open(model_filename, 'wb').write(r.content)

print('Erfolg!')

In [None]:
#nothing to do here, just execute (Shift + Enter)

x = pd.read_csv(vectors_filename, 
                sep=' ', 
                nrows=10000, 
                header=None, 
                converters={0: lambda s: ast.literal_eval(s).decode('utf-8')})
print('Erfolg!')

In [None]:
#nothing to do here, just execute (Shift + Enter)

X = np.array(x)
X_vectors = X[:, 1:]
X_words = X[:, 0]
print('Erfolg!')

In [None]:
#nothing to do here, just execute (Shift + Enter)
#executing of this cell can take some time ~1min

X_2d = TSNE(n_components=2).fit_transform(X_vectors)
print('Erfolg!')

## Playground №1. Choose your words 
You can:
- do nothing, and run everything just like it is
- add words you like (in quotes, like `'Freitag'`), don't forget the comma at the end!
- delete words
- ignore words by _commenting them out_: just add `#` in front of the word  (see example near 'Kathrin')
- add them back by deleting '#'


In [None]:
words = [
    'Mutter',
    'Vater',
    'der',
    'die',
    'das',
    'Schwester',
    'Straße',
    'Straßenbahn',
    'Haus',
    'Tier',
    #'Kathrin',
    'Christian',
    'Mann',
    'Frau',
    'schwimmen',
    'in',
    'im',
    'Herr',
    'Schule',
    'Osten',
    'Anna',
    'England',
    'Bär',
    'Wahrheit',
    'Deutschland',
    'Idee',
    'spielen',
    'dem',
    'Berlin',
    'Traum',
    'Zeit',
    'Spiegel',
    'Freundschaft',
    'Liebe',
    'Göttin'
]

print('Erfolg!')

In [None]:
#execute (Shift + Enter) and check if all words are available in dictionary

words = np.array(list(map(lambda w: w.lower(), words)))
ind = np.flatnonzero([word_ in words for word_ in X_words])

if ind.shape[0] != words.shape[0]:
  print('Warning! Check these words: ', set(words) - set(X_words[ind]))
  print('It could be spelled wrong, or not available')

print('Number of words you wrote, and number of them in the dictionary:', words.shape, X_words[ind].shape)

## Playground №2. Group words  
Here you can see the words on the screen. You can see that some words are close together, some aren't.

There's a slider over the plot.
- Try dragging it and see what it does! 
- Find the best number of groups

- See if there are interesting groups of words
- Try giving names to the groups
- Does anything surprise you? Are any words not where you expect them to be?
- Think of a word, try guessing where it'll be on the screen. Then add it and see if it's there!

Try clicking all the buttons located on the top-right of the screen and see what they do!

You can:
- Zoom in and out
- Click and drag to move, once you're zoomed in


In [None]:
import plotly.express as px
from ipywidgets import interact
import ipywidgets as widgets


def plot_func(groups):
    X_plotly = {
        'words': X_words[ind],
        'x': X_2d[ind, 0],
        'y': X_2d[ind, 1]
    }


    kmeans = KMeans(n_clusters=int(groups), random_state=0).fit(X_2d[ind])
    labels_pred = kmeans.predict(X_2d[ind])
    X_plotly['labels'] = labels_pred

    fig = px.scatter(X_plotly, x="x", y="y", text="words", color='labels')
    fig.update_traces(textposition='top center')
    fig.show()

interact(plot_func, groups = widgets.FloatSlider(value=5,
                                               min=1,
                                               max=10,
                                               step=1))

## Explanation of kmeans and word2vec (not ready)

![1*b2sO2f--yfZiJazc5rYSpg.gif](https://miro.medium.com/max/875/1*b2sO2f--yfZiJazc5rYSpg.gif)

If you want to understand how grouping(K-means) works - you can watch the video: https://www.youtube.com/watch?v=Gn6fPYD1oIU


## Playground 3. Find the most similar words 
Here you can try:
1. Determine how similar two words are, where 1 - very similar, 0 - not similar at all
2. Get top-10 similar words to a given word

In [None]:
#nothing to do here, just execute (Shift + Enter)

from gensim.models import KeyedVectors

model = KeyedVectors.load_word2vec_format(model_filename, binary=True)
print('Erfolg!')

In [None]:
# Give two words and see how similar they are, where 1 - very similar, 0 - not similar at all

model.similarity('Vater', 'Frau')

In [None]:
#Give a word and get top-10 similar words

model.most_similar('Deutschland')

## Explanation of similarity (not ready)