This notebook contains some useful functions for computing the *tf-idf* for the data with arXiv abstracts.

Note here the use of specific pandas function like *concat*, *value_counts* and *groupby* which make possible to speed up the computations.

In [1]:
import numpy as np
import pandas as pd
import re
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/hugojulia/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
data = pd.read_csv("./arxiv_articles_simplified.csv", sep="|")

In [3]:
data.loc[2:2, "summary"]

2    A new simple proof of Stirling's formula via t...
Name: summary, dtype: object

In [4]:
def naif_regex_tokenize(text):
    """
    This is a very naif way of tokenize a text. Just using the
    regular expression "[a-z]" that will match any single word
    in lowercase.
    Returns a list with all the tokens.
    """
    p = re.compile("[a-z]+")
    return p.findall(text.lower())

def compute_tf(d):
    """
    Compute the tf for a given document d.
    The formula used is 
    
        tf(t, d) = 0.5 + 0.5 * (count(t, d)/max(count(t',d) for t' in d))
    
    This prevents bias in longer documents.
    """
    terms = pd.Series(naif_regex_tokenize(d))
    term_counts = terms.value_counts()
    max_tc = max(term_counts)
    return 0.5 + 0.5 * (term_counts / max_tc)

def compute_idf(D):
    """
    The input D is a list of pandas.Series
    having as each element, the term frequency 
    computed by the function compute_tf.
    """
    N = len(D)
    all_terms = pd.concat(D)
    nt = all_terms.index.value_counts() # The number of documents containing the term "t"
    return np.log(N / nt)

def compute_tf_idf_document(tf_document, idf):
    """Compute the tf-idf for each term in a document of the corpus

    Keyword arguments:
    tf_document -- list with the frequency of each term inside the document
    idf -- the idf value for each term in the corpus
    """
    return tf_document * np.array([idf[i] for i in tf_document.index])
    
def compute_tf_idf_corpus(D):
    """Compute the tf-idf for each term in a corpus

    Keyword arguments:
    D -- pandas Series containing a collection of documents in text format
    
    returns
        list of pandas Series containing the tf-idf(t, d, D) for each term
        inside each document of the corpus D
    """
    term_freq = [compute_tf(d) for d in D]
    idf = compute_idf(term_freq)
    return [compute_tf_idf_document(d, idf) for d in term_freq]



In [5]:
s = data['summary']
print(s)

0        We show that every essentially countable orbit...
1        Hans Grauert died in September of 2011. This a...
2        A new simple proof of Stirling's formula via t...
3        To each natural deformation quantization on a ...
4        We show that finite Galois extensions with cyc...
                               ...                        
39606    Recently, a novel method for developing filter...
39607    We consider the problem of numerically evaluat...
39608    Ecological processes may exhibit memory to pas...
39609    Complex networks are used to describe a broad ...
39610    The Hamiltonian Monte Carlo method generates s...
Name: summary, Length: 39611, dtype: object


In [6]:
D1 = data['summary'][:10]
print(D1)

0    We show that every essentially countable orbit...
1    Hans Grauert died in September of 2011. This a...
2    A new simple proof of Stirling's formula via t...
3    To each natural deformation quantization on a ...
4    We show that finite Galois extensions with cyc...
5    We develop Nevanlinna's theory for a class of ...
6    Let H, K be subgroups of G. We investigate the...
7    We discuss the mathematician George Bruce Hals...
8    A proof for the maximum modulus principle (in ...
9    The aim of this paper is to give some applicat...
Name: summary, dtype: object


In [7]:
tf_idf = compute_tf_idf_corpus(data.loc[:, "summary"]) # Sauvegarder avec pickle


At this stage we have the tf-idf values at each document. In order to select the *most important* terms (i.e. the terms with higher tf-idf values), we compute the **mean** of the tf-idf for each term.

In [8]:
print(s[1])

Hans Grauert died in September of 2011. This article reviews his life in mathematics and recalls some detail his major accomplishments.


In [9]:
all_terms = pd.concat(tf_idf)
print(all_terms['fms'])

fms    5.771334
fms    5.819832
dtype: float64


In [10]:
mean_tf_idf = all_terms.groupby(all_terms.index).mean()
print(mean_tf_idf)

a           0.070318
aa          3.745126
aaa         4.856940
aaaaa       5.671533
aaaattga    5.850634
              ...   
zyjhk       5.513991
zymes       5.604809
zynga       8.079447
zynq        5.653551
zz          6.805840
Length: 63578, dtype: float64


In [11]:
sorted_tf_idf = mean_tf_idf.sort_values(ascending=False)

In [12]:
print(len(sorted_tf_idf))

63578


In [13]:
sorted_tf_idf["the"]


0.017889347956179486

## Maintenant c'est à vous

Utiliser le code et les fonctions ci-dessus pour :

1. Proposer un dictionnaire des termes (une liste de mots qui peuvent être "informatives"). Par exemple, un mot qui est présent dans plus de 10 documents différents et qui a un fort valeur de *tf-idf* peut etre "informatif". 

2. Utiliser ces termes pour calculer, pour chaque document, un vecteur *tf-idf*. Ce vecteur aura les valeurs du tf-idf de chaqu'un des termes. Par example, si la liste de termes est ["float", "genetic", "circular"] et qu'on a 4 document. On doit produire une matrice de 4 lignes et 3 colonnes :
```
0.1 5.8 9
4.7 1.0 3
8.0 2.4 6.0
0.3 9.1 3.2
```   
Ici, chaque ligne contient les valeurs de *tf-idf* de ["float", "genetic", "circular"] (dans cet ordre).

3. Normaliser les lignes de cette matrice. La norme 2 des vecteurs *tf-idf* représentés à chaque ligne doit être 1. Les étapes **2** et **3** font partie de ce qu'on appelle *feature extraction*. 

4. Executer l'exemple décrit [ici](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction). Appliquer la même analyse sur notre jeu des données.

5. Faire une implémentation de l'algorithme k-means. Voir les liens : https://en.wikipedia.org/wiki/K-means_clustering, https://en.wikipedia.org/wiki/K-means%2B%2B, https://fr.wikipedia.org/wiki/K-moyennes

6. Implémenter une fonction permettant de trier les documents en fonction du résultat de l'algorithme k-means avec **k** groupes.

7. Executer l'exemple décrit [ici](https://scikit-learn.org/stable/auto_examples/text/plot_document_clustering.html#sphx-glr-auto-examples-text-plot-document-clustering-py) puis appliquer la même analyse sur notre jeu des données



In [34]:
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english'))

# fonction qui détermine si un mot est dans un texte (retourne un boolean)
def appartient(texte, mot):
    liste = naif_regex_tokenize(texte)
    appartient = False
    for i in liste:
        if mot == i:
            appartient = True
            break
    return appartient 

#fonction qui retourne une liste de mots qui apparaissent dans plus de 10 docs
def new_termes(liste_mots, corpus):
    new_list = []
    liste_mots_filtered = []
    
    for j in liste_mots:
        if j not in stopwords:
            liste_mots_filtered.append(j)
    
    
    for mot in liste_mots_filtered:
        n = 0
        for texte in corpus:
            if appartient(texte, mot):
                n += 1
        if n >= 3:
            new_list += [mot]         
    return new_list

#on prend les 500 mots avec le plus fort tf-idf et on regarde si ils apparaissent dans plus de 10 docs parmi les 1500 premiers documents du dataset
informatifs = new_termes(sorted_tf_idf[:2000].index, data.loc[:1500, "summary"])

print(informatifs)


['birational']


In [31]:
def vecteur_tf_idf (informatifs, tf_idf, corpus):
    vecteur = np.zeros([len(corpus), len(informatifs)])
    for doc, tfidf in enumerate(tf_idf):
        for mot in tfidf.index:
            if mot in informatifs:
                vecteur[doc][informatifs.index(mot)]=tfidf[mot]
    return vecteur
      
vecteur_tf_idf = vecteur_tf_idf(informatifs, tf_idf, data.loc[:1500, "summary"])
print(vecteur_tf_idf[:5])
print(len(vecteur_tf_idf))

[[0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]]
1501


In [35]:
def normalise (vecteur_tf_idf):
    norme = vecteur_tf_idf
    S = np.zeros(len(vecteur_tf_idf))
    for i in range(0, vecteur_tf_idf.shape[0]):
        for j in range(0, vecteur_tf_idf.shape[1]):
            S[i] += vecteur_tf_idf[i][j]
    for m in range(0, norme.shape[0]):
        if S[m] == 0:
            continue
        else:
            norme[m] = np.true_divide(norme[m], S[m])  #np.true_divide divise toute la ligne de la matrice par la somme de ses éléments
    return norme

L = normalise(vecteur_tf_idf)

for i in range(0, L.shape[0]):    #afficher toutes les lignes qui sont non nulles
    for j in range(0, L.shape[1]):
        if L[i][j] != 0:
            print(L[i])

[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0.         0.         0.         0.53175633 0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.46824367]
[0.         0.         0.         0.53175633 0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.46824367]
[0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
[0. 0. 0. 0.

In [None]:
S = normalise(vecteur_tf_idf)

for i in range(len(S)):
    if S[i] != 0:
        print(S[i])