## TF-IDF

Importamos las librerias necesarias

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

Corpus de prueba

In [2]:
corpus = ['This is the first document.',
          'This document is the second document.',
          'And this is the third one.',
          'Is this the first document?']

corpus

['This is the first document.',
 'This document is the second document.',
 'And this is the third one.',
 'Is this the first document?']

Instanciamos la clase TfidfVectorizer()

In [3]:
vectorizer = TfidfVectorizer()

Aplicamos el metodo fit (el algoritmo aplica TF-IDF a nuestro corpus)

In [4]:
vectorizer.fit(corpus)

TfidfVectorizer()

Ahora que la instancia ya está entrenada con nuestro corpus, podemos transformar nuestro corpus en una matriz documento-termino usando el método transform()

In [5]:
doc_term_matrix = vectorizer.transform(corpus)

In [6]:
# tamaño de la matriz documento-termino
doc_term_matrix.shape

(4, 9)

Imprimimos el vocabulario de terminos (9 terminos)

In [7]:
vocabulary = vectorizer.get_feature_names_out()
print(vocabulary)

['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']


Matriz documento-termino de TF-IDF

In [8]:
doc_term_matrix.toarray()

array([[0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,
        0.        , 0.38408524, 0.        , 0.38408524],
       [0.        , 0.6876236 , 0.        , 0.28108867, 0.        ,
        0.53864762, 0.28108867, 0.        , 0.28108867],
       [0.51184851, 0.        , 0.        , 0.26710379, 0.51184851,
        0.        , 0.26710379, 0.51184851, 0.26710379],
       [0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,
        0.        , 0.38408524, 0.        , 0.38408524]])

ej. el primer documento está representado por el vector:

In [9]:
print(doc_term_matrix.toarray()[0])

[0.         0.46979139 0.58028582 0.38408524 0.         0.
 0.38408524 0.         0.38408524]


Top de palabras acorde con el score de ponderacion idf

In [10]:
def get_top_n(vectorizer, n):
    weights = vectorizer.idf_
    sorted_index_array = np.argsort(weights)
    
    sorted_weights = weights[sorted_index_array[-n:]]
    sorted_vocabulary = vocabulary[sorted_index_array[-n:]]
    
    for word, idf in zip(sorted_vocabulary[::-1], sorted_weights[::-1]):
        print("word: "+word+"\t - idf score: "+str(idf))

In [11]:
get_top_n(vectorizer, 10)

word: third	 - idf score: 1.916290731874155
word: second	 - idf score: 1.916290731874155
word: one	 - idf score: 1.916290731874155
word: and	 - idf score: 1.916290731874155
word: first	 - idf score: 1.5108256237659907
word: document	 - idf score: 1.2231435513142097
word: this	 - idf score: 1.0
word: the	 - idf score: 1.0
word: is	 - idf score: 1.0


In [13]:
from sklearn.preprocessing import normalize

In [14]:
doc_1 = np.array([0, 1, 1, 1, 0, 0, 1, 0, 1])
idf_vector = vectorizer.idf_

In [15]:
idf_vector

array([1.91629073, 1.22314355, 1.51082562, 1.        , 1.91629073,
       1.91629073, 1.        , 1.91629073, 1.        ])

In [16]:
tf_idf_doc_1 = doc_1 * idf_vector
tf_idf_doc_1

array([0.        , 1.22314355, 1.51082562, 1.        , 0.        ,
       0.        , 1.        , 0.        , 1.        ])

In [17]:
tf_idf_doc_1 = doc_1 * idf_vector
normalize(tf_idf_doc_1.reshape(1, -1))

array([[0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,
        0.        , 0.38408524, 0.        , 0.38408524]])