# Using centroid-based clustering to group languages by similarity
In this project I will try to group languages by similarity.

I will be using a part of clustering algorithms called "centroids". Initially the position of these centroids will depend solely on the position of one datasample each, but as we are fitting the model we'll remove the centroids to which the least amount of languages belong.

The training cycle will be:

1. Move every centroid to the average position of samples belonging to it (stop when there's no improvement)
2. Remove the smallest centroid
3. Repeat until there are only a select number of centroids left

The data I'll be using for this project was given to me by a friend during a workshop. It consists of txt files with text from wikipedia articles in different languages.


## About clustering
Clustering is a type of machine learning in which the meaning of the data is found in the relationship between the data samples. This information can be exploited by grouping similar data in different ways, and then look at the collective information in each such group.

If you don't have labels for your data this is still a great way to explore it. You may learn more about certain parts of your data by looking at how it relates to other parts of your data, and how strong these connections are.

There are many different types of clustering algorithms, and they may accomplish different tasks with different accuracies. The main families in the clustering-algorithm-family-tree are divisive and agglomerative algorithms (respectively dividing data points into groups, or joining data points into groups), both of which contain hierarchical and centroid-based approaches.

### Let's begin by importing the necessary stuff
The homemade code for this project will be the "centroidspace" class, and the load_languages function.

In [1]:
%matplotlib inline

# Homemade stuff
from clustering.centroidspace import centroidspace
from testing_data.load_languages import load_languages

# Other stuff
import numpy as np
import matplotlib.pyplot as plt

### Loading the data
For this I've created a function to easily load the data from this repo. The "raw_articles" will contain a library with keys for the language of every file we load. A list of all these is then printed.

In [2]:
# Loading text from articles
raw_articles = load_languages()

# Printing list of languages
keys = list(raw_articles.keys())
for key in keys: print(key)

Czech
Indonesian
German
Waray-Waray
Basque
Esperanto
Swedish
Slovenian
Estonian
Turkish
Finnish
Catalan
Vietnamese
Lithuanian
Malay
Simple English
Uzbek
Danish
French
Galician
Cebuano
Latin
English
Minangkabau
Hungarian
Romanian
Spanish
Italian
Portuguese
Dutch
Croatian
Slovak
Polish
Norwegian (Nynorsk)


### Figuring out a way to find features in our data
The data we have now loaded is contained purely in strings - something that typically won't go well with machine learning algorithms - and thus we'll have to convert it to numbers somehow. We wan't the features of our data to remain intact, and these features should appear independently of how long strings we present to our model.

There are probably a bunch of ways to do this, but what I wan't to do here is to check the frequency by which any pair of letters appear in a language. This approach probably won't allow us to find differences between highly similar languages, but it should suffice for this project.

In the following function I do just that. It's given a string, loops through the string to count occurences of all pairs of letters. This is then converted to an array and divided by the total number of observations in order to get a ratio for occurences. By doing this division our model should be able to function no matter how long given text samples are.

In [3]:
# Function to calculate frequency of character-pair occurences
def calculate_frequency(text, length=10000):
    # Cleaning up text
    text = text.replace('\n', ' ')[:length].lower()
    # Setting up containers
    letters = 'abcdefghijklmnopqrstuvwxyz '
    pairs = {}
    for let1 in letters:
        for let2 in letters:
            pairs[let1+let2] = 0
    # Calculating...
    total = 0
    for i in range(2, len(text)):
        pair = text[i-2:i]
        if pair in pairs:
            pairs[pair] += 1
            total += 1
    # Returning frequencies
    return np.divide([pairs[key] for key in pairs], total)

In [4]:
# Calculating pair frequencies for all languages
X = [calculate_frequency(raw_articles[key], 100000) for key in raw_articles]

### Creating the model
The homemade class for this model is the centroidspace class, which we've already imported. This class utilizes a smaller class called "centroid" to handle clusters and their content in a somewhat efficient manner. Having only 34 different classes I'm going to initialize this model with one centroid in the exact location of every data sample.

In [5]:
# Creating model
model = centroidspace(init_positions=X)

### Fitting the model
As mentioned in the beginning, we will here be doing two things to improve our model. First the model will be fitted to our data for 50 epochs (or until there's no improvement), and then the centroid to which the fewest amount of datapoints belong will be deleted. The process is repeated until only a select number of centroids.

Feel free to change the number of centroids. It's interesting to see how languages are grouped when you do.

In [6]:
# Fitting model
model.reductionfit(X, min_centroids=10)

### Training finished. Time to see the results!
First: Let's look at how the model grouped the languages it was shown during training.

In [7]:
# Testing and grouping by cluster
for i in range(len(keys)):
    model.predict(X[i], keys[i])

In [8]:
# Printing cluster contents
for i in range(len(model.centroids)):
    print('\nCluster %s contents:' % i)
    for label in model.centroids[i].labels: print(label)


Cluster 0 contents:
Basque
Turkish
Uzbek
Hungarian

Cluster 1 contents:
Estonian
Finnish
Lithuanian

Cluster 2 contents:
Indonesian
Malay
Minangkabau

Cluster 3 contents:
Waray-Waray
Vietnamese
Cebuano

Cluster 4 contents:
Catalan
French
Spanish

Cluster 5 contents:
Esperanto
Galician
Latin
Romanian
Italian
Portuguese

Cluster 6 contents:
German
Dutch

Cluster 7 contents:
Slovenian
Croatian

Cluster 8 contents:
Czech
Slovak
Polish

Cluster 9 contents:
Swedish
Simple English
Danish
English
Norwegian (Nynorsk)


This looks cool! Our model is able to group all these languages in a way that seems to make sense! I found it interesting that french was put in a cluster with spanish and catalan. I thought it would be closer to english or german.

### Testing with strings it haven't seen yet
Let's make sure our model actually figured something out

In [9]:
# Some testing strings
test_lib = {'english': 'everything gets better with red wine',
            'italian': 'tutto migliora con il vino rosso',
            'malay':   'semuanya menjadi lebih baik dengan wain merah',
            'idiot':   'covfefe'}

# Running predictions and printing
for key in test_lib:
    sample = test_lib[key]
    pos = calculate_frequency(sample)
    closest_centroid = model.predict(pos)
    print('\n"%s" is %s and the closest centroid also contains' % (sample, key))
    for label in closest_centroid.labels: print('- %s' % label)


"everything gets better with red wine" is english and the closest centroid also contains
- Swedish
- Simple English
- Danish
- English
- Norwegian (Nynorsk)

"tutto migliora con il vino rosso" is italian and the closest centroid also contains
- Esperanto
- Galician
- Latin
- Romanian
- Italian
- Portuguese

"semuanya menjadi lebih baik dengan wain merah" is malay and the closest centroid also contains
- Indonesian
- Malay
- Minangkabau

"covfefe" is idiot and the closest centroid also contains
- Czech
- Slovak
- Polish


Great! Everything seems to work pretty well!

Please play arround with the number of centroids, test other strings, or write your own versions of this code. I invite anyone who sees this to comment or contribute to this repo.

Thanks for reading, and happy hacking!