# Grouping Texts Experiments

Clustering similar texts based on [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance). "Levenshtein distance is a string metric for measuring the difference between two sequences" (Wikipedia). We are going to experiment with a text clustering strategy in [Fake Names](http://listofrandomnames.com/) and [Medium Articles](https://www.kaggle.com/hsankesara/medium-articles). In the end, I hope you are going to be able to execute your own text clustering experiments.

-----
<a id="data"></a>
# Data Exploration

Data loading and data samples. I'm just creating two datasets to experiment the clustering.
- Fake Names
- Medium Articles

In [None]:
# standard libraries
import numpy as np
import pandas as pd

## Fake Names

We simply have to generate a list of 300 fake names using http://listofrandomnames.com/. Also, create a short list with only 26 names.

In [None]:
names = ['Tashia Monsen', 'Marcy Sondag', 'Kristine Wool', 'Shantay Cubbage', 'Duncan Albano', 'Dollie Everhart', 'Sherryl Funston', 'Cherly Gooding', 'Elois Lasch', 'Irina Temme', 'Candis Sievert', 'Kris Difilippo', 'Rosana Bocanegra', 'Ernestina Thacker', 'Terrilyn Antonio', 'Maxwell Kin', 'Gilberte Laxton', 'Roberto Pavone', 'Alanna Hervey', 'Orlando Heit', 'Brianna Cutshall', 'Eveline Alvidrez', 'Rick Benes', 'Leann Shimer', 'Josie Witcher', 'Lissette Imburgia', 'Candra Coloma', 'Denis Eldreth', 'Alfred Sue', 'Stevie Brannan', 'Lou Derouin', 'Missy Helman', 'Crissy Mejorado', 'Pamila Villegas', 'Carmella Waren', 'Shondra Kyles', 'Ashely Utley', 'Kenya Bottomley', 'Tiara Ball', 'Elza Starke', 'Linsey Howley', 'Caridad Wensel', 'Armanda Burmeister', 'Rosalba Zuber', 'Briana Eggleton', 'Keli Zelinski', 'Elena Hewlett', 'Asia Richburg', 'Ida Gerena', 'Corrina Weingarten', 'Nery Dewall', 'Aurore Boeke', 'Bok Yaeger', 'Phebe Stotz', 'Debbi Budde', 'Lionel Gartner', 'Danny Tusa', 'Dori Schrimsher', 'Cole Rando', 'Gladys Woolley', 'Micheal Derr', 'Kyoko Bryne', 'Lane Ditty', 'Eileen Klink', 'Antone Sturgeon', 'Chantelle Howerton', 'Siu Hendricks', 'Florrie Sears', 'Maryetta Gutierez', 'Vasiliki Borgmann', 'Eura Lovins', 'Bette Beech', 'Dino Pasko', 'Esther Margulies', 'Crissy Behar', 'Keesha Landau', 'Sean Grainger', 'Gaynell Sease', 'Maryellen Felps', 'Delmer Briles', 'Margurite Depriest', 'Bettye Shaikh', 'Denna Lawton', 'Janet Roark', 'Catrice Ruzicka', 'Marx Sing', 'Billie Shewmaker', 'Darla Mathew', 'Micheline Theisen', 'Rosenda Plum', 'Amee Hippler', 'Johnna Stickel', 'Shirely Tennison', 'Ossie Shadwick', 'Tarra Winton', 'Lincoln Burket', 'Lovie Wiesner', 'Chloe Eyler', 'Olene Groves', 'Ashely Blades', 'Mahalia Breazeale', 'Lavinia Agudelo', 'Lessie Westbrooks', 'Kenda Isenberg', 'Dianne Trumble', 'Elsie Legree', 'Louetta Delucca', 'Hattie Cozad', 'Roderick Kirklin', 'Paola Lagunas', 'Janiece Christain', 'Mayme Shoulders', 'Rolland Oxley', 'Brittani Buttery', 'Rosalia Difalco', 'Philip Leroux', 'Therese Mroz', 'Georgette Chacon', 'Gaynell Mumm', 'Arminda Flannery', 'Stella Peeples', 'Stevie Hardesty', 'Edyth Glotfelty', 'Jama Gervais', 'Kaye Pariseau', 'Albertha Furby', 'Jenine Gephart', 'Isidra Below', 'Ethelene Lesage', 'Soraya Hardcastle', 'Oralee Bussiere', 'Latisha Mcmurry', 'Rosemary Mauldin', 'Michiko Fu', 'Marshall Giblin', 'Trent Tong', 'Laureen Vives', 'Janett Cecere', 'Clarisa Allain', 'Stefania Frigo', 'Anastacia Cypert', 'Emmett Forward', 'Bettina Gong', 'Jenise Longstreet', 'Dick Carranza', 'Valentin Hearn', 'Genna Sera', 'Signe Coster', 'Pearlene Yant', 'Karine Twining', 'Olive Whaley', 'Mathilda Tomasi', 'Terrilyn Panos', 'Malia Brandy', 'Stanley Molnar', 'Melvin Sutterfield', 'Dianna Roney', 'Lola Skoglund', 'Mitchell Snelgrove', 'Julian Schrum', 'Evelynn Messing', 'Shaunta Chon', 'Mica Coate', 'Christene Ingerson', 'Karolyn Grasty', 'Josphine Horiuchi', 'Jesica Kerns', 'Osvaldo Roush', 'Omega Lena', 'Selena Garlington', 'Karena Kitchell', 'Ouida Stampley', 'Jarrett Decola', 'Susan Stage', 'Kena Kmetz', 'Denita Houghton', 'Marylou Ashman', 'Frank Bellard', 'Shanae Cassella', 'Ashley Burleson', 'Raisa Keck', 'Latia Houck', 'Vergie Hunte', 'Phylicia Meiers', 'Rigoberto Holton', 'Nelson Rohan', 'Loraine Christensen', 'Xiomara Whittingham', 'Tory Prisco', 'Kaleigh Amezcua', 'Carlo Marlowe', 'Adriene Adger', 'Delsie Contreras', 'Vertie Nardone', 'Lynelle Roder', 'Leatha Mccary', 'Bobette Baran', 'Yanira Mau', 'Carma Jung', 'Daysi Belin', 'William Pinard', 'Demetria Collins', 'Randi Levar', 'Hunter Surette', 'Hosea Degner', 'Zelma Appling', 'Nicola Byam', 'Kai Coomes', 'Signe Lavergne', 'Lan Wasko', 'Scott Simoneaux', 'Briana Maclin', 'Tresa Cullison', 'Catarina Presley', 'Justine Chou', 'Gabriel Hammonds', 'Al Hickel', 'Eura Coto', 'Piedad Cureton', 'Daron Muise', 'Cassey Randolph', 'Jodie Hansley', 'Theressa Marciniak', 'Dwana Talamantez', 'Nathanael Lemley', 'Clotilde Labelle', 'Wanda Patman', 'Cammy Lemieux', 'Jeraldine Small', 'Carlota Settle', 'Reva Kinyon', 'Miyoko Beckmann', 'Geraldo Lapoint', 'Caterina Spells', 'Zane Rosecrans', 'Micki Bosque', 'Quinn Loar', 'Ozella Kamerer', 'Yanira Evangelista', 'Gloria Bodner', 'Bree Gillard', 'Isidra Jakes', 'Sherwood Umland', 'Shery Brinn', 'Leona Stiner', 'Derek Szczepanski', 'Garnet Stoodley', 'Gracie Acres', 'Anneliese Yoshimoto', 'Paulene Dora', 'Werner Kerfoot', 'Venus Makuch', 'Columbus Nastasi', 'Shala Croteau', 'Johanne Stam', 'Ranee Wardwell', 'Mitsue Lentine', 'Elyse Escobar', 'Mayola Hiltz', 'Ione Helbing', 'Ming Atnip', 'Germaine Mclawhorn', 'Devin Elkin', 'Calvin Whisenant', 'Lai Glavin', 'Antonina Fernald', 'Fabiola Fahie', 'Paulene Hilyard', 'Nakia Jack', 'Janie Arwood', 'Thi Staples', 'Jaimee Garceau', 'Tana Mera', 'Alexa Eddy', 'Maribel Seguin', 'Beula Clement', 'Janette Mcniff', 'Jaleesa Nyman', 'Barbar Brueggemann', 'Ruthie Petrick', 'Cassy Niemeyer', 'Trina Gaudette', 'Sal Gabourel', 'Ji Reinhart', 'Abby Kump', 'Lucinda Aten', 'Minh Brumfield', 'Eladia Hiler', 'Exie Scholes', 'Cordelia Ott', 'Junita Osburn', 'Sparkle Leedy', 'Odell Castiglia', 'Randal Levalley', 'Tish Mccarthy', 'Stephen Lombardi', 'Elvin Reppert', 'Edgardo Prendergast', 'Arianne Wass', 'Viva Brake']
few_names = ['Arianne Wass', 'Jeraldine Small', 'Justine Chou', 'Kris Difilippo', 'Kristine Wool', 'Gabriel Hammonds', 'Georgette Chacon', 'Gilberte Laxton', 'Rigoberto Holton', 'Roberto Pavone', 'Vertie Nardone', 'Asia Richburg', 'Dori Schrimsher', 'Jodie Hansley', 'Josie Witcher', 'Josphine Horiuchi', 'Lovie Wiesner', 'Briana Eggleton', 'Briana Maclin', 'Brianna Cutshall', 'Brittani Buttery', 'Marshall Giblin', 'Rosemary Mauldin', 'Theressa Marciniak', 'Tresa Cullison', 'Yanira Evangelista']

In [None]:
from random import sample

print('Samples:', sample(names,3))

## Medium Articles

We are going to use a dataset called [Medium Articles](https://www.kaggle.com/hsankesara/medium-articles), which has of over 300 articles related Machine Learning, Artificial Intelligence, and Data Science areas.

In [None]:
# display files
import os

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df = pd.read_csv('/kaggle/input/medium-articles/articles.csv')
df.sample(3)

In [None]:
# preprocess the medium titles
# - remove special characters
import re
import unidecode

medium = df['title'].str.lower()
medium = medium.apply(unidecode.unidecode)
medium = medium.apply(lambda x: re.sub(' +', ' ', x))

-----
<a id="cluster"></a>
# Text Clustering

Let's define our clustering algorithm.
- We have to compute the Levenshtein similarity
- Further, we apply the Affinity Propagation Clustering algorithm to combine the texts

**Why** do I choose Levenshtein and Affinity Propagation?
- Levenshtein is a simple distance algorithm that can be used in word-level or character-level. Thus, it can handle different types of text; that is going to be better explained later with samples.
- [Affinity Propagation](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AffinityPropagation.html) finds the ideal number of clusters and also an exemplar per cluster. This exemplar is like a centroid or the best example of the cluster. However, Affinity Propagation cannot handle properly with many texts (_e.g.,_ over five thousand) causing some inconsistent and weird clusters.

In [None]:
%%capture
!pip install Distance

In [None]:
import distance
from sklearn.cluster import AffinityPropagation

<a id="lev"></a>
## Levenshtein Distance

First, we have to create our Levenshtein similarity. I also propose a Weighted Levenshtein similarity, based on the text length, just for experimentation.

In [None]:
def levenshtein(texts):
    '''
    Levenshtein Distance
    - It requires negative similarities, so -1 * levenshtein(t1, t2)
    '''
    texts = np.asarray(texts, dtype=object)
    _similarity = np.array([[distance.levenshtein(list(w1),list(w2)) for w1 in texts] for w2 in texts])
    _similarity = -1*_similarity
    return _similarity

In [None]:
# Levenshtein in character-level
texts = ['Lion', 'Leon']

levenshtein(texts)

In [None]:
# Levenshtein in word-level
texts = [
    ['hi', 'my', 'name', 'is'], 
    ['hello', 'my', 'surname', 'is']]

levenshtein(texts)

<a id="clus"></a>
## Clusteing

At last, we can define our Text Clustering algorithm, using a similarity function and the Affinity Propagation for the clustering.

Also, I defined a `word_level` parameter to say whether we have to compute in character-level or compute in word-level. It is recommended compute in word-level when you have a lot of long texts that usually have similar words. Thus, you uses word-level to compare texts by words instead of characters. For example, we are going to use character-level in fake names, and word-level in Medium texts.

In [None]:
def text_clustering(texts, similarity=levenshtein, word_level=False):
    '''Text Clustering'''
    # similarity
    if word_level: texts = [t.split() for t in texts]
    _similarity = levenshtein(texts)
    _affprop = AffinityPropagation(affinity="precomputed", damping=0.5, verbose=True,
        random_state=0, max_iter=1_000, convergence_iter=10)
    _affprop.fit(_similarity)
    return _affprop, _similarity


def print_clusters(affprop, texts):
    '''Print clusters'''
    texts = np.asarray(texts)
    clusters = np.unique(affprop.labels_)
    print(f'\n~ Number of texts:: {texts.shape[0]}')
    print(f'~ Number of clusters:: {clusters.shape[0]}')
    if clusters.shape[0] < 2: return 'Only few clusters - Stopped'
    for cluster_id in clusters:
        exemplar = texts[affprop.cluster_centers_indices_[cluster_id]]
        cluster = np.unique(texts[np.nonzero(affprop.labels_==cluster_id)])
        cluster_str = '";\n  "'.join(cluster)
        print(f'\n# Cluster ({cluster_id}) with ({len(cluster)}) elements')
        print(f'Exemplar:: {exemplar}')
        print(f'\nOthers::\n  "{cluster_str}"')

### Fake Name

Using only a set of 26 fake names.

In [None]:
# using only a set of 26 fake names
texts = few_names
affprop, _ = text_clustering(texts, similarity=levenshtein)
print_clusters(affprop, texts)

#### Discussion

Let's analyze the cluster results. **Notes:**

- The algorithm was able to find four (4) clusters in 26 fake names.
- We compare the texts using character-level, thereby comparing each character.

The most simple way to analyze each cluster is: "compare the examplar with the others". So, lets' do that.

**Cluster 0**
- Exemplar: `Kristine Wool`
- Notes: 
   1. names ending with 'ne'
   1. names starting with 'Kris'
   1. surnames ending with 'ol', 'o' or 'l'
   1. surnames starting with 'W'

**Cluster 1**
- Exemplar: `Gilberte Laxton`
- Notes: 
   1. names ending with 'te'
   1. names starting with 'G'
   1. names with 'l', 'ert' in the middle
   1. surnames ending, or close of the ending, with 'on'
   1. surnames with 'a' in the second character

**Cluster 2**
- Exemplar: `Josie Witcher`
- Notes: 
   1. names ending with 'ie', 'ia', or 'i'
   1. names starting with 'Jo' or 'Lo'
   1. surnames ending with 'er'
   1. surnames starting with 'W' or 'Wi'
   1. 'Asia Richburg' looks weird here

**Cluster 3**
- Exemplar: `Briana Maclin`
- Notes: 
   1. names with 'Briana'
   1. names ending with 'na' or 'a'
   1. names starting with 'Br'
   1. surnames ending with 'in' or 'on'
   1. surnames starting with 'Ma'
   1. 'Yanira Evangelista' looks weird here

Thus, we can see pretty nice results, in general. However, we find some names that are different from the examplar - this can happend due many reasons, like: (1) this name is similar to another name in the cluster, (2) test differents `damping` values in Affinity Propagation algorithm, or (3) try to modify the Levenshtein metric.

### Medium Articles

#### Word-Level Analysis

In [None]:
texts = medium
affprop, _ = text_clustering(texts, similarity=levenshtein, word_level=True)
print_clusters(affprop, texts)

#### Discussion

Let's analyze the cluster results. **Notes:**

- The algorithm was able to find four 75 clusters in 337 articles.
- We compare the texts using word-level, thereby comparing each word.

We cannot analyze each of the clusters because we have a lot of them. Thus, I am going to highlight just some of them.

- **Cluster 4** - "machine learning is fun!" serie
- **Cluster 6** - "artificial intelligence -- (...)" articles
- **Cluster 25** - a serie of introduction content for begginers
- **Cluster 35** - "how to build (...)" or "how to create (...)" articles
- **Cluster 36** - articles about futurism in artificial intelligence
- **Cluster 43** - articles about "reinforcement learning"
- **Cluster 60** - "the future of (...)" articles and more futurism
- **Cluster 73** - "neural network" articles

However, we find a lot of clusters with only one element. To improve the results, it is necessary develop some Data Preprocess step before clustering.



#### Character-Level Analysis

_The algorithm was not able to find clusters when analyzed in character-level._

In [None]:
texts = medium
affprop, _ = text_clustering(texts, similarity=levenshtein, word_level=False)
print_clusters(affprop, texts)

-----
<a id="ref"></a>
# References

Additional readings about Text Clustering.

- [Medium - A Friendly Introduction to Text Clustering (2020)](https://towardsdatascience.com/a-friendly-introduction-to-text-clustering-fa996bcefd04)
- [Medium - Affinity Propagation Algorithm Explained (2019)](https://towardsdatascience.com/unsupervised-machine-learning-affinity-propagation-algorithm-explained-d1fef85f22c8)
- [Paper - Clustering by passing messages between data points (2007)](https://www.science.org/doi/abs/10.1126/science.1136800)
- [sklearn - Clustering text documents using k-means](https://scikit-learn.org/stable/auto_examples/text/plot_document_clustering.html)
- [stack overflow - Clustering a long list of strings (words) into similarity groups (2020)](https://stats.stackexchange.com/questions/123060/clustering-a-long-list-of-strings-words-into-similarity-groups)