This notebook contains some useful functions for computing the *tf-idf* for the data with arXiv abstracts.

Note here the use of specific pandas function like *concat*, *value_counts* and *groupby* which make possible to speed up the computations.

In [1]:
import numpy as np
import pandas as pd
import re
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\NOUS\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
data = pd.read_csv("./ASRS_data.csv", sep="|")

In [3]:
data1 = pd.read_csv("./arxiv_articles.csv", sep="|")

In [4]:
data.loc[2:2, "Narrative"]

2    THIS WAS MY FIRST DEP FROM BFI ON 31L. MY TURN...
Name: Narrative, dtype: object

In [5]:
data1.loc[2:2,"summary"]

2    A new simple proof of Stirling's formula via t...
Name: summary, dtype: object

In [6]:
def naif_regex_tokenize(text):
    """
    This is a very naif way of tokenize a text. Just using the
    regular expression "[a-z]" that will match any single word
    in lowercase.
    Returns a list with all the tokens.
    """
    p = re.compile("[a-z]+")
    return p.findall(text.lower())

def compute_tf(d):
    """
    Compute the tf for a given document d.
    The formula used is 
    
        tf(t, d) = 0.5 + 0.5 * (count(t, d)/max(count(t',d) for t' in d))
    
    This prevents bias in longer documents.
    """
    terms = pd.Series(naif_regex_tokenize(d))
    term_counts = terms.value_counts()
    max_tc = max(term_counts)
    return 0.5 + 0.5 * (term_counts / max_tc)

def compute_idf(D):
    """
    The input D is a list of pandas.Series
    having as each element, the term frequency 
    computed by the function compute_tf.
    """
    N = len(D)
    all_terms = pd.concat(D)
    nt = all_terms.index.value_counts() # The number of documents containing the term "t"
    return np.log(N / nt)

def compute_tf_idf_document(tf_document, idf):
    """Compute the tf-idf for each term in a document of the corpus

    Keyword arguments:
    tf_document -- list with the frequency of each term inside the document
    idf -- the idf value for each term in the corpus
    """
    return tf_document * np.array([idf[i] for i in tf_document.index])
    
def compute_tf_idf_corpus(D):
    """Compute the tf-idf for each term in a corpus

    Keyword arguments:
    D -- pandas Series containing a collection of documents in text format
    
    returns
        list of pandas Series containing the tf-idf(t, d, D) for each term
        inside each document of the corpus D
    """
    term_freq = [compute_tf(d) for d in D]
    idf = compute_idf(term_freq)
    return [compute_tf_idf_document(d, idf) for d in term_freq]



In [7]:
s = data['Narrative']
print(s)

0         APPROX 5 MI NW OF THE MISSION BAY VOR, WHILE A...
1         WE WERE IN OUR CLB FROM CMH TO DFW, WITH A CLR...
2         THIS WAS MY FIRST DEP FROM BFI ON 31L. MY TURN...
3         A VFR FLT, BEING CONDUCTED UNDER FAR PART 91, ...
4         SITUATION BEGAN ON A ROUTINE VFR PLEASURE FLT ...
                                ...                        
174564    Copy clearance for flight. ATC changed the fir...
174565    I was conducting a training flight with a stud...
174566    Received pre-departure clearance through comme...
174567    October 2013, I had B757-200 aircraft Work Pac...
174568    I have run into an issue on [some B757s] when ...
Name: Narrative, Length: 174569, dtype: object


In [8]:
D1 = data['Narrative'][:10]
print(D1)

0    APPROX 5 MI NW OF THE MISSION BAY VOR, WHILE A...
1    WE WERE IN OUR CLB FROM CMH TO DFW, WITH A CLR...
2    THIS WAS MY FIRST DEP FROM BFI ON 31L. MY TURN...
3    A VFR FLT, BEING CONDUCTED UNDER FAR PART 91, ...
4    SITUATION BEGAN ON A ROUTINE VFR PLEASURE FLT ...
5    CLRD DIRECT PVT VOR AFTER TKOF BOS. USING R NA...
6    60 GALLONS \"JET A\" FUEL ADDED TO LEFT WING D...
7    ACR Y CLIMBING TO FL210 WAS STOPPED AT 160 FOR...
8    SOME TIME HAD PASSED AFTER WE HAD PASSED THE R...
9    ACFT X ON FINAL FOR RWY 30L (FLT OF 2) MISSED ...
Name: Narrative, dtype: object


In [9]:
tf_idf = compute_tf_idf_corpus(data.loc[:300, "Narrative"]) # Sauvegarder avec pickle


In [10]:
tf_idf1 = compute_tf_idf_corpus(data1.loc[:300, "summary"])

At this stage we have the tf-idf values at each document. In order to select the *most important* terms (i.e. the terms with higher tf-idf values), we compute the **mean** of the tf-idf for each term.

In [11]:
print(s[1])

WE WERE IN OUR CLB FROM CMH TO DFW, WITH A CLRNC TO FL310. PASSING FL280, IND CENTER ASKED FOR A WIND RPT, WHICH WE GAVE. IND CENTER THEN ISSUED CLB CLRNC TO FL350. PASSING FL320, IND CENTER ASKED OUR ALT, THEN GAVE THE CLRNC, \"DES TO FL310.\" WE ACKNOWLEDGED AND I ASKED IF THERE WAS A TFC PROB. THE REPLY WAS, \"NOT YET, BUT I NEED YOU LEVEL IN 1 MIN.\" TFC PASSED ABOUT 1 MIN LATER OPP DIR AT FL330. IND CENTER THEN ASKED WHAT ALT WE HAD BEEN CLRD TO, TO WHICH WE RESPONDED FL350. I HAD READ BACK THE CLRNC AND THE F/O SET THE ALT ALERT. WE ARE SURE THE CLRNC WAS GIVEN IN ERROR TO FL350. ALSO, THE OTHER ACFT WAS COMING FROM ANOTHER CTLRS AIRSPACE. NO TFC CONFLICT OCCURRED.


In [12]:
all_terms = pd.concat(tf_idf)
print(all_terms['fms'])

fms    4.178303
fms    2.654451
dtype: float64


In [13]:
mean_tf_idf = all_terms.groupby(all_terms.index).mean()
print(mean_tf_idf)

a              0.089807
abatement      3.138911
abbreviated    2.902754
abc            3.779032
abcd           2.967697
                 ...   
zma            3.057380
zme            3.138911
zoa            3.170617
zone           2.240944
zones          2.575782
Length: 5433, dtype: float64


In [14]:
sorted_tf_idf = mean_tf_idf.sort_values(ascending=False)

In [15]:
print(len(sorted_tf_idf))

5433


In [16]:
sorted_tf_idf["the"]


0.05074843450252969

## Maintenant c'est à vous

Utiliser le code et les fonctions ci-dessus pour :

1. Proposer un dictionnaire des termes (une liste de mots qui peuvent être "informatives"). Par exemple, un mot qui est présent dans plus de 10 documents différents et qui a un fort valeur de *tf-idf* peut etre "informatif". 

2. Utiliser ces termes pour calculer, pour chaque document, un vecteur *tf-idf*. Ce vecteur aura les valeurs du tf-idf de chaqu'un des termes. Par example, si la liste de termes est ["float", "genetic", "circular"] et qu'on a 4 document. On doit produire une matrice de 4 lignes et 3 colonnes :
```
0.1 5.8 9
4.7 1.0 3
8.0 2.4 6.0
0.3 9.1 3.2
```   
Ici, chaque ligne contient les valeurs de *tf-idf* de ["float", "genetic", "circular"] (dans cet ordre).

3. Normaliser les lignes de cette matrice. La norme 2 des vecteurs *tf-idf* représentés à chaque ligne doit être 1. Les étapes **2** et **3** font partie de ce qu'on appelle *feature extraction*. 

4. Executer l'exemple décrit [ici](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction). Appliquer la même analyse sur notre jeu des données.

5. Faire une implémentation de l'algorithme k-means. Voir les liens : https://en.wikipedia.org/wiki/K-means_clustering, https://en.wikipedia.org/wiki/K-means%2B%2B, https://fr.wikipedia.org/wiki/K-moyennes

6. Implémenter une fonction permettant de trier les documents en fonction du résultat de l'algorithme k-means avec **k** groupes.

7. Executer l'exemple décrit [ici](https://scikit-learn.org/stable/auto_examples/text/plot_document_clustering.html#sphx-glr-auto-examples-text-plot-document-clustering-py) puis appliquer la même analyse sur notre jeu des données



In [17]:
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english'))

# fonction qui détermine si un mot est dans un texte (retourne un boolean)
def appartient(texte, mot):
    liste = naif_regex_tokenize(texte)
    appartient = False
    for i in liste:
        if mot == i:
            appartient = True
            break
    return appartient 

#fonction qui retourne une liste de mots qui apparaissent dans plus de 10 docs
def new_termes(liste_mots, corpus):
    new_list = []
    liste_mots_filtered = []
    
    for j in liste_mots:
        if j not in stopwords:
            liste_mots_filtered.append(j)
    
    for mot in liste_mots_filtered:
        n = 0
        for texte in corpus:
            if appartient(texte, mot):
                n += 1
        if n >= 2:
            new_list += [mot]         
    return new_list

#on prend les 300 mots avec le plus fort tf-idf et on regarde si ils apparaissent dans plus de 10 docs parmi les 1000 premiers documents du dataset
informatifs = new_termes(sorted_tf_idf[:500].index, data.loc[:1000, "Narrative"])

print(informatifs)


['aural', 'bfi', 'zla', 'edw', 'louis', 'csd', 'washington', 'pie', 'morristown', 'hour', 'ingested', 'oakland', 'driver', 'keying', 'accidentally', 'delta', 'anc', 'roc', 'smoothly', 'guire', 'nellis', 'retarded', 'mci', 'cart', 'tailskid', 'car', 'checking', 'sectionals', 'occurring', 'mains', 'doors', 'lenticular', 'fluctuations', 'retard', 'linkage', 'cue', 'access', 'ont', 'vital', 'reports', 'install', 'issuance', 'rows', 'xf', 'vehicle', 'instrs', 'inlet', 'abc', 'pao', 'boss', 'omega', 'throttle', 'teb', 'bravo', 'boston', 'hayden', 'tpa', 'fleet', 'romeo', 'kc', 'test', 'columbia', 'mlt', 'communications', 'unaccustomed', 'dividing', 'interrupted', 'carb', 'elements', 'forget', 'plugs', 'detroit', 'analysis', 'taxiway', 'pin', 'driven', 'lunken', 'tvc', 'forecast', 'abe', 'xay', 'national', 'oxygen', 'tractor', 'vny', 'private', 'norad', 'heli', 'zjx', 'monte', 'syr', 'bal', 'vice', 'obstacles', 'vest', 'trusting', 'excessive', 'obtaining', 'relations', 'aerial', 'carefully', 

In [18]:
def vecteur_tf_idf (informatifs, tf_idf, corpus):
    vecteur = np.zeros([len(corpus), len(informatifs)])
    for doc, tfidf in enumerate(tf_idf):
        for mot in tfidf.index:
            if mot in informatifs:
                vecteur[doc][informatifs.index(mot)]=tfidf[mot]
    return vecteur
      
vecteur_tf_idf = vecteur_tf_idf(informatifs, tf_idf, data.loc[:1000, "Narrative"])
print(vecteur_tf_idf[:100])
print(len(vecteur_tf_idf))

[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         4.75592522 0.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         3.17061681]
 [0.         0.         0.         ... 0.         0.         0.        ]]
1001


In [19]:
def normalise (vecteur_tf_idf):
    norme = vecteur_tf_idf
    S = np.zeros(len(vecteur_tf_idf))
    for i in range(0, vecteur_tf_idf.shape[0]):
        for j in range(0, vecteur_tf_idf.shape[1]):
            S[i] += vecteur_tf_idf[i][j]
    for m in range(0, norme.shape[0]):
        if S[m] == 0:
            continue
        else:
            norme[m] = np.true_divide(norme[m], S[m])  #np.true_divide divise toute la ligne de la matrice par la somme de ses éléments
    return norme

vectors = normalise(vecteur_tf_idf)


for i in range(0, vectors.shape[0]):    #afficher toutes les lignes qui sont non nulles
    for j in range(0, vectors.shape[1]):
        if vectors[i][j] != 0:
            print(vectors[i])

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 

 0.         0.        ]
[0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.23285325 0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.25571558 0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.       

 0.  0.  0.  0.  0.  0.  0.  0. ]
[0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
 0.  0.  0.  0.2 0.  0.  0.  0.  0.2 0.2 0.  0.  0.2 0.  0.  0.  0.  0.
 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
 0.  0.  0.2 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
 0.  0.  0.  0.  0.  0.  0.  0

 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.

 0.  0.  0.  0.  0.  0.  0.  0. ]
[0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.5 0.  0.5 0.  0.  0.
 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
 0.  0.  0.  0.  0.  0.  0.  0

In [20]:
#S = normalise(vecteur_tf_idf)

#for i in range(len(S)):
    #if S[i] != 0:
        #print(S[i])

In [28]:
# My k-means
import numpy as np
#import plotly.express as px
import pandas as pd

def distance(v1, v2):
    """
    Compute the distance between v1 and v2
    v1 and v2 are numpy arrays
    """
    return np.sqrt(np.sum((v1-v2)**2))

def assign(vectors, centers):
    """
    assign each vector to the closest center.
    vectors is a numpy matrix. We want to assign each
    row to the closest center.
    centers is a numpy matrix. Each row has a center
    
    returns a list of integers. 
    One value for each vector indicaing the closest center
    """
    groups = np.zeros(vectors.shape[0])
    for i in range(len(groups)):
        groups[i] = np.argmin(np.apply_along_axis(distance, 1, centers, vectors[i]))
    return groups

def compute_centers(vectors, groups):
    """
    Compute the centers for each group of 
    vectors
    vectors is a numpy matrix
    groups is a list containing the assignments
    of the vectors
    """
    new_centers = np.zeros([int(max(groups)) + 1, vectors.shape[1]])
    for i in range(int(max(groups)) + 1):
        ix = np.where(groups==i)[0]
        grp_members = vectors[ix, :]
        new_centers[i] = grp_members.mean(0)
    return new_centers

def choose_first_centers(vectors, k):
    """
    Select the first k centers for the begining of the
    k-means algorithm
    """
    ix = np.arange(0, vectors.shape[0])
    np.random.shuffle(ix)
    return vectors[ix[:k], :]

def kmeans(vectors, k, max_iterations = 500):
    """
    Naive implementation of k-means algorithm
    """
    centers_list = []
    centers = choose_first_centers(vectors, k)
    centers_list.append(centers)
    groups = assign(vectors, centers)
    new_centers = compute_centers(vectors, groups)
    centers_list.append(new_centers)
    nb_iter = 0
    while (np.sum(np.abs(centers - new_centers)) > 0) or (nb_iter > max_iterations):
        centers = np.copy(new_centers)
        groups = assign(vectors, centers)        
        new_centers = compute_centers(vectors, groups)
        centers_list.append(new_centers)
        nb_iter += 1
    print(groups)    
    return new_centers, centers_list

    
    

In [29]:
centers, centers_list = kmeans(vectors, 4)

[0. 0. 0. ... 0. 0. 0.]


In [23]:
centers, centers_list


vectors[:,:2]

array([[0., 0.],
       [0., 0.],
       [0., 1.],
       ...,
       [0., 0.],
       [0., 0.],
       [0., 0.]])

In [24]:
#c, c_list = kmeans(vectors[:,:2], 5)

In [25]:
df = pd.DataFrame({})

for i in range(len(centers_list)):
    v = {"x":vectors[:, 0], "y":vectors[:, 1], "p_type":["data_point"]*vectors.shape[0], 
    "iteration":[i]*vectors.shape[0]}
    df = pd.concat([df, pd.DataFrame(v)])
    c = {"x":centers_list[i][:, 0], "y":centers_list[i][:, 1], "p_type":["center"]*centers_list[0].shape[0],
    "iteration":[i]*centers_list[0].shape[0]}
    print(i)
    df = pd.concat([df, pd.DataFrame(c)])

0
1


ValueError: arrays must all be same length

In [None]:
import plotly.express as px

In [None]:
px.scatter(df, x="x", y="y", animation_frame="iteration", color="p_type")