## Introduction au calcul de la matrice TF-IDF

Le TF-IDF (de l'anglais *term frequency-inverse document frequency*) permet de transformer une collection de documents en une matrice contennant de valeurs numériques. 

Pour une introduction à cet notion, lire [ici](https://fr.wikipedia.org/wiki/TF-IDF)

Il y a plusieurs manières de calculer le TF-IDF. Ici, nous allons utiliser les formules suivantes :

Soit $D$, l'ensemble de tous les documents, $d$, un document dans $D$. Soit $T$, l'ensemble de tous les termes d'un dictionnaire et $t$, un mot dans $T$. 

$$
TF(t, d) = \frac{f_{t, d}}{\max(f_{t', d}), t'\in d}
$$

$$
IDF(t, D) = \log\frac{N}{1 + |d\in D : t\in d|}
$$

$$
TFIDF(t, d, D) = TF(t, d) * IDF(t, D)
$$

Avec :

* $f_{t, d}$ : Quantité d'apparition du terme $t$ dans le document $d$
* $N$ : Quantité de documents dans le corpus, $N=|D|$
* $|d\in D : t\in d|$ : Quantité de documents contennant le terme $t$


### Un exemple 

On part de la collection de documents $D$ suivante :

1. "The sky is blue."
2. "The sun is bright today."
3. "The sun in the sky is bright."
4. "We can see the shining sun, the bright sun."

Supposons qu'on a le dictionnaire $T$ composé par les mots : ["sky", "sun", "the"].
Notre matrice TD-IDF sera une matrice de **4 lignes et 3 colonnes**. Pour la calculer, on fait comme suit :

$$N = |D| = 4$$

$$IDF("sky", D) = \log\frac{N}{1 + |d\in D : t\in d|} = \log\frac{N}{1 + |d\in D : "sky"\in d|} = \log\frac{4}{1 + 2}$$

Notez que le mot "sky" apparait dans les documents 1 et 3. Donc, la quantite de documents contennant le mot "sky" ($|d\in D : "sky"\in d|$) est égale à 2.

* $f("sky", d1) = 1$
* $f("sky", d2) = 0$
* $f("sky", d3) = 1$ 
* $f("sky", d4) = 0$

On calcule maintenant le TF en utilisant la formule précedente :

* $TF("sky", d1) = 1$
* $TF("sky", d2) = 0$
* $TF("sky", d3) = 1/2$ (Le mot "the" apparait 2 fois et c'est le terme dont la fréquence est la plus forte)
* $TF("sky", d4) = 0$

On a donc, la première colonne de la matrice TF-IDF :


 D     |__________ sky_______________ | sun           | the  |
---    |------------------------------| --------------|----- |
**1**  | $0.5 * \log\frac{4}{1 + 2}$  | ?             | ?    |
**2**  | $0$                          | ?             |   ?  |
**3**  | $0.5 * \log\frac{4}{1 + 2}$  | ?             |    ? |
**4**  | 0                            | ?             |    ? |


$$𝐼𝐷𝐹("sun",𝐷) = \log\frac{N}{1 + |d\in D : t\in d|} = \log\frac{N}{1 + |d\in D : "sun"\in d|} = \log\frac{4}{1 + 3}$$


* $f("sun", d1) = 0$
* $f("sun", d2) = 1$
* $f("sun", d3) = 1$ 
* $f("sun", d4) = 2$

On calcule maintenant le TF en utilisant la formule précedente :

* $TF("sun", d1) = 0$
* $TF("sun", d2) = 1$
* $TF("sun", d3) = 1/2$ (Le mot "the" apparait 2 fois et c'est le terme dont la fréquence est la plus forte)
* $TF("sun", d4) = 2 / 2 = 1$

Pour le TF-IDF


* $TF-IDF("sun", d1, D) = TF("sky", d1) * IDF("sun", D) = 0 * 0 = 0$
* $TF-IDF("sun", d2, D) = TF("sky", d2) * IDF("sun", D) = 1 * 0 = 0$
* $TF-IDF("sun", d3, D) = TF("sky", d3) * IDF("sun", D) = 1/2 * 0 = 0$
* $TF-IDF("sun", d4, D) = TF("sky", d4) * IDF("sun", D) = 1 * 0 = 0$


On peut rajouter les valeurs correspondant à "sun" dans la matrice TF-IDF :

 D     |__________ sky_______________ | sun           | the  |
---    |------------------------------| --------------|----- |
**1**  | $0.5 * \log\frac{4}{1 + 2}$  | 0             | ?    |
**2**  | $0$                          | 0             |   ?  |
**3**  | $0.5 * \log\frac{4}{1 + 2}$  | 0             |    ? |
**4**  | 0                            | 0             |    ? |





In [1]:
import numpy as np
import pandas as pd
import re
import math

In [2]:
data = pd.read_csv("./arxiv_articles_simplified.csv", sep="|")

In [3]:
data.loc[0:5, :]

Unnamed: 0,id,title,authors,arxiv_primary_category,summary,published,updated,general_category
0,http://arxiv.org/abs/2001.05867v1,$σ$-Lacunary actions of Polish groups,Jan Grebik,math.LO,We show that every essentially countable orbit...,2020-01-16T15:09:02Z,2020-01-16T15:09:02Z,math
1,http://arxiv.org/abs/1303.6933v1,Hans Grauert (1930-2011),Alan Huckleberry,math.HO,Hans Grauert died in September of 2011. This a...,2013-03-27T19:23:57Z,2013-03-27T19:23:57Z,math
2,http://arxiv.org/abs/1407.3775v1,A New Proof of Stirling's Formula,Thorsten Neuschel,math.HO,A new simple proof of Stirling's formula via t...,2014-07-10T11:26:39Z,2014-07-10T11:26:39Z,math
3,http://arxiv.org/abs/math/0307381v3,On Dequantization of Fedosov's Deformation Qua...,Alexander V. Karabegov,math.QA,To each natural deformation quantization on a ...,2003-07-30T06:20:33Z,2003-09-20T01:29:18Z,math
4,http://arxiv.org/abs/1604.06794v1,Cyclic extensions are radical,Mariano Suárez-Álvarez,math.HO,We show that finite Galois extensions with cyc...,2016-04-21T22:24:54Z,2016-04-21T22:24:54Z,math
5,http://arxiv.org/abs/1712.09576v2,The Second Main Theorem in the hyperbolic case,Min Ru;Nessim Sibony,math.CV,We develop Nevanlinna's theory for a class of ...,2017-12-27T13:17:08Z,2019-01-03T07:51:11Z,math


## Maintenant c'est à vous

1 - Implémenter des fonctions permettant de calculer la matrice TF-IDF avec le dataset de l'exemple et pour le dictionnaire de mots suivant :

T = ['blue','bright', 'can', 'in', 'is', 'see', 'shining', 'sky', 'sun', 'the', 'today', 'we']

2 - Utiliser les fonctions implementées pour calculer la matrice TF-IDF correspondant au même dictionnaire $T$ et aux 100 premiers documents de la collection *./arxiv_articles_simplified.csv*


In [6]:
D = data["summary"][:10]
T = ['blue','bright', 'can', 'in', 'is', 'see', 'shining', 'sky', 'sun', 'the', 'today', 'we']

def IDF(word,D):                          #fonction qui calcule IDF
    compteur3 = 0
    for X in range(D.size):
        if word in D[X]:
            compteur3 = compteur3 + 1
    return(math.log((D.size)/(1 + compteur3)))

print(IDF("in", D))


def nombre(mot, phrase):                  #fonction qui calcule le nombre d'occurence d'un mot dans une phrase
    val=0
    phrase1=phrase.split(" ")
    for X in range(len(phrase1)):
        if mot == phrase1[X]:
            val= val + 1   
            
    return val

def max(phrase):                         #fonction qui trouve le nombre de fois qu'apparait le nombre qui apparait le plus de fois
    phrase1=phrase.split(" ")
    maxi2 = None
    for X in range(len(phrase1)):
        
        if maxi2 is None or nombre(phrase1[X], phrase) > maxi2:
            maxi2 = nombre(phrase1[X], phrase) 
    return maxi2
            
            
def TF(mot, phrase):                    #fonction TF
    phrase1=phrase.split(" ")
    return((nombre(mot, phrase)) / max(phrase))
    
print(TF("in", D[1]))         


def TFIDF(mot, D, phrase):             #fonction IDF
    return TF(mot, phrase)*IDF(mot, D)

print(TFIDF("In", D, D[1]))

def matrice(T, D):                   
    
    # Matrice TF-IDF
    tab = np.zeros([D.size,len(T)])
    for I, X in enumerate(D):
        for J, m in enumerate(T):
            tab[I, J] = TFIDF(m, D, X)
    return tab

pd.DataFrame(matrice(T,D))



0.0
1.0
0.0


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02634,0.0,0.229073
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.105361,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.105361,0.0,0.229073
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.105361,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03512,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.105361,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.07024,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
