In [1]:
import pandas as pd
import numpy as np

In [2]:
corpus = ['data science is one of the most important fields of science',
          'this is one of the best data science courses',
          'data scientists analyze data' ]

In [3]:
corpus

['data science is one of the most important fields of science',
 'this is one of the best data science courses',
 'data scientists analyze data']

In [4]:
words_set = set()

for doc in  corpus:
    words = doc.split(' ')
    words_set = words_set.union(set(words))
    
print('Number of words in the corpus:',len(words_set))
print('The words in the corpus: \n', words_set)

Number of words in the corpus: 14
The words in the corpus: 
 {'fields', 'most', 'analyze', 'important', 'courses', 'science', 'data', 'one', 'this', 'is', 'the', 'scientists', 'of', 'best'}


**Cómputo de la frecuencia del término**
Ahora podemos crear un marco de datos por la cantidad de documentos en el corpus y el conjunto de palabras, y usar esa información para calcular la frecuencia del término (TF):

**A continuación, crearemos un conjunto de palabras para el corpus:**

In [5]:
n_docs = len(corpus)         #·Number of documents in the corpus
n_words_set = len(words_set) #·Number of unique words in the 

df_tf = pd.DataFrame(np.zeros((n_docs, n_words_set)), columns=words_set)

# Compute Term Frequency (TF)
for i in range(n_docs):
    words = corpus[i].split(' ') # Words in the document
    for w in words:
        df_tf[w][i] = df_tf[w][i] + (1 / len(words))
        
df_tf

Unnamed: 0,fields,most,analyze,important,courses,science,data,one,this,is,the,scientists,of,best
0,0.090909,0.090909,0.0,0.090909,0.0,0.181818,0.090909,0.090909,0.0,0.090909,0.090909,0.0,0.181818,0.0
1,0.0,0.0,0.0,0.0,0.111111,0.111111,0.111111,0.111111,0.111111,0.111111,0.111111,0.0,0.111111,0.111111
2,0.0,0.0,0.25,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.25,0.0,0.0


El dataframe anterior muestra que tenemos una columna para cada palabra y una fila para cada documento. Esto muestra la frecuencia de cada palabra en cada documento.

**Ahora, calcularemos la frecuencia inversa del documento (IDF):**

In [6]:
print("IDF of: ")

idf = {}

for w in words_set:
    k = 0    # number of documents in the corpus that contain this word
    
    for i in range(n_docs):
        if w in corpus[i].split():
            k += 1
            
    idf[w] =  np.log10(n_docs / k)
    
    print(f'{w:>15}: {idf[w]:>10}' )

IDF of: 
        science: 0.17609125905568124
             is: 0.17609125905568124
         fields: 0.47712125471966244
      important: 0.47712125471966244
        courses: 0.47712125471966244
     scientists: 0.47712125471966244
           best: 0.47712125471966244
           data:        0.0
           this: 0.47712125471966244
            the: 0.17609125905568124
            one: 0.17609125905568124
             of: 0.17609125905568124
        analyze: 0.47712125471966244
           most: 0.47712125471966244


# **Juntando todo:  TF-IDF**
Como ahora tenemos TF e IDF, podemos calcular TF-IDF:

In [7]:
df_tf_idf = df_tf.copy()

for w in words_set:
    for i in range(n_docs):
        df_tf_idf[w][i] = df_tf[w][i] * idf[w]
        
df_tf_idf

Unnamed: 0,science,is,fields,important,courses,scientists,best,data,this,the,one,of,analyze,most
0,0.032017,0.016008,0.043375,0.043375,0.0,0.0,0.0,0.0,0.0,0.016008,0.016008,0.032017,0.0,0.043375
1,0.019566,0.019566,0.0,0.0,0.053013,0.0,0.053013,0.0,0.053013,0.019566,0.019566,0.019566,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.11928,0.0,0.0,0.0,0.0,0.0,0.0,0.11928,0.0


Tenga en cuenta que "datos" tiene un IDF de 0 porque aparece en todos los documentos. Como resultado, no se considera un término importante en este corpus. Esto cambiará ligeramente en la siguiente implementación de sklearn, donde los "datos" no serán cero.

# TF-IDF Using scikit-learn

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

Primero necesitamos instanciar la clase, luego podemos llamar al método **fit_transfor**m en nuestro corpus de prueba. Esto realizará todos los cálculos que realizamos anteriormente.

In [9]:
tr_idf_model  = TfidfVectorizer()
tf_idf_vector = tr_idf_model.fit_transform(corpus)

Después de vectorizar el corpus por la función, se obtiene una matriz dispersa.

Aquí está la forma actual de la matriz:

In [10]:
print(type(tf_idf_vector), tf_idf_vector.shape)

<class 'scipy.sparse.csr.csr_matrix'> (3, 14)


Y podemos convertir a una matriz regular para tener una mejor idea de los valores:

In [11]:
tf_idf_array = tf_idf_vector.toarray()

print(tf_idf_array)

[[0.         0.         0.         0.18952581 0.32089509 0.32089509
  0.24404899 0.32089509 0.48809797 0.24404899 0.48809797 0.
  0.24404899 0.        ]
 [0.         0.40029393 0.40029393 0.23642005 0.         0.
  0.30443385 0.         0.30443385 0.30443385 0.30443385 0.
  0.30443385 0.40029393]
 [0.54270061 0.         0.         0.64105545 0.         0.
  0.         0.         0.         0.         0.         0.54270061
  0.         0.        ]]


Ahora es muy sencillo obtener los términos originales en el corpus usando **get_feature_names**:

In [12]:
words_set = tr_idf_model.get_feature_names()

print(words_set)

['analyze', 'best', 'courses', 'data', 'fields', 'important', 'is', 'most', 'of', 'one', 'science', 'scientists', 'the', 'this']


Finalmente, crearemos un marco de datos para mostrar mejor los puntajes TF-IDF de cada documento:

In [13]:
df_tf_idf = pd.DataFrame(tf_idf_array, columns = words_set)

df_tf_idf

Unnamed: 0,analyze,best,courses,data,fields,important,is,most,of,one,science,scientists,the,this
0,0.0,0.0,0.0,0.189526,0.320895,0.320895,0.244049,0.320895,0.488098,0.244049,0.488098,0.0,0.244049,0.0
1,0.0,0.400294,0.400294,0.23642,0.0,0.0,0.304434,0.0,0.304434,0.304434,0.304434,0.0,0.304434,0.400294
2,0.542701,0.0,0.0,0.641055,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.542701,0.0,0.0


Como puede ver en el resultado anterior, los puntajes de TF-IDF son diferentes a los puntajes obtenidos por el proceso manual que usamos anteriormente. Esta diferencia se debe a la implementación de TF-IDF de Sklearn, que utiliza una fórmula ligeramente diferente.