# Clase 14
## Vectorizacion TF-IDF
### Vectores TF
Sabemos que los vectores $TF$, para el token $t$ dado el corpus $C$ de un documento $D$, con un lexico $L$ es:

$TF(t,D)=  \frac{\text{Numero de veces que aparece el token $t$ en el documento $D$}} {\text{Numero de tokens en el documento $D$}}$
### Vectores TF-IDF
Calculamos el vector $TF$ pero env es de un corpus, solo de un documento $D$ y de un lexico del documento $L_D$.

$$TF(t,D)=  \dfrac{\text{Numero de veces que aparece el token $t$ en el documento $D$}} {\text{Numero de tokens en el documento $D$}}$$

Y la parte $IDF$ es

$$IDF(t,C)=  \dfrac{\text{Numero de documentos en el corpus $C$}} {\text{Numero de documentos en el corpus $C$ que  contienen $t$}}$$

Entonces así 

$$\text{TF-IDF}(t,D,C) = \text{TF}(t,d) \times ln(\text{IDF}(t,C))$$

In [1]:
!pip install -U scikit-learn

Requirement already up-to-date: scikit-learn in d:\program files\python37\lib\site-packages (0.23.2)

You should consider upgrading via the 'd:\program files\python37\python.exe -m pip install --upgrade pip' command.





In [2]:
from nlpia.data.loaders import kite_text, kite_history
from nltk.tokenize import TreebankWordTokenizer

tokenizer = TreebankWordTokenizer()
kite_intro = kite_text.lower()
intro_tokens = tokenizer.tokenize(kite_intro)

  [datetime.datetime, pd.datetime, pd.Timestamp])
  MIN_TIMESTAMP = pd.Timestamp(pd.datetime(1677, 9, 22, 0, 12, 44), tz='utc')
  np = pd.np
  np = pd.np
INFO:nlpia.constants:Starting logger in nlpia.constants...
  np = pd.np
  np = pd.np
INFO:nlpia.loaders:No BIGDATA index found in d:\program files\python37\lib\site-packages\nlpia\data\bigdata_info.csv so copy d:\program files\python37\lib\site-packages\nlpia\data\bigdata_info.latest.csv to d:\program files\python37\lib\site-packages\nlpia\data\bigdata_info.csv if you want to "freeze" it.
INFO:nlpia.futil:Reading CSV with `read_csv(*('d:\\program files\\python37\\lib\\site-packages\\nlpia\\data\\mavis-batey-greetings.csv',), **{'low_memory': False})`...
INFO:nlpia.futil:Reading CSV with `read_csv(*('d:\\program files\\python37\\lib\\site-packages\\nlpia\\data\\sms-spam.csv',), **{'low_memory': False})`...


In [3]:
kite_history = kite_history.lower()
history_tokens = tokenizer.tokenize(kite_history)

In [4]:
print(kite_intro)
intro_total = len(intro_tokens)
print(intro_total)

a kite is traditionally a tethered heavier-than-air craft with wing surfaces that react against the air to create lift and drag. a kite consists of wings, tethers, and anchors. kites often have a bridle to guide the face of the kite at the correct angle so the wind can lift it. a kite's wing also may be so designed so a bridle is not needed; when kiting a sailplane for launch, the tether meets the wing at a single point. a kite may have fixed or moving anchors. untraditionally in technical kiting, a kite consists of tether-set-coupled wing sets; even in technical kiting, though, a wing in the system is still often called the kite.

the lift that sustains the kite in flight is generated when air flows around the kite's surface, producing low pressure above and high pressure below the wings. the interaction with the wind also generates horizontal drag along the direction of the wind. the resultant force vector from the lift and drag force components is opposed by the tension of one or mo

In [5]:
print(kite_history)
history_total = len(history_tokens)
print(history_total)

kites were invented in china, where materials ideal for kite building were readily available: silk fabric for sail material; fine, high-tensile-strength silk for flying line; and resilient bamboo for a strong, lightweight framework.

the kite has been claimed as the invention of the 5th-century bc chinese philosophers mozi (also mo di) and lu ban (also gongshu ban). by 549 ad paper kites were certainly being flown, as it was recorded that in that year a paper kite was used as a message for a rescue mission. ancient and medieval chinese sources describe kites being used for measuring distances, testing the wind, lifting men, signaling, and communication for military operations. the earliest known chinese kites were flat (not bowed) and often rectangular. later, tailless kites incorporated a stabilizing bowline. kites were decorated with mythological motifs and legendary figures; some were fitted with strings and whistles to make musical sounds while flying. from china, kites were introd

# Calculo TF-IDF
Calcularemos TF-IDF a 3 palabras contenidas en estos textos.
Por ejemplo, a la palabra "kite"

In [6]:
from collections import Counter

#Creamos estos diccionarios para guardar valores para ambos documentos
intro_tf = {}
history_tf = {}

intro_counts = Counter(intro_tokens)
intro_tf['kite'] = intro_counts['kite'] / intro_total

history_counts = Counter(history_tokens)
history_tf['kite'] = history_counts['kite'] / history_total

In [7]:
print(intro_tf)
print(history_tf)

{'kite': 0.0440771349862259}
{'kite': 0.020202020202020204}


Aqui vemos que la palabra kite es mas importante en el intro_tf que en el history_tf

In [8]:
intro_tf['and'] = intro_counts['and'] / intro_total
history_tf['and'] = history_counts['and'] / history_total
print(intro_tf)
print(history_tf)

{'kite': 0.0440771349862259, 'and': 0.027548209366391185}
{'kite': 0.020202020202020204, 'and': 0.030303030303030304}


Sin embargo tambien vemos que la palabra and es mas importante que kite en el history que en el intro

Ahora calcularemos en cuantos documentos de los dos, encontramos la palabra 'and', 'kite' y 'china'

In [9]:
num_docs_containing_and = 0
for doc in [intro_tokens, history_tokens]:
    if 'and' in doc:
        num_docs_containing_and += 1

num_docs_containing_kite = 0
for doc in [intro_tokens, history_tokens]:
    if 'kite' in doc:
        num_docs_containing_kite += 1
    
num_docs_containing_china = 0
for doc in [intro_tokens, history_tokens]:
    if 'china' in doc:
        num_docs_containing_china += 1
    
print(f'and {num_docs_containing_and}')
print(f'kite {num_docs_containing_kite}')
print(f'china {num_docs_containing_china}')

and 2
kite 2
china 1


In [10]:
intro_tf['china'] = intro_counts['china'] / intro_total
history_tf['china'] = history_counts['china'] / history_total

Ahora calcularemos los IDF

In [11]:
num_docs = 2
intro_idf = {}
history_idf = {}

In [12]:
intro_idf['and'] = num_docs / (num_docs_containing_and + 1)
history_idf['and'] = num_docs / (num_docs_containing_and + 1)

intro_idf['kite'] = num_docs / (num_docs_containing_kite + 1)
history_idf['kite'] = num_docs / (num_docs_containing_kite + 1)

intro_idf['china'] = num_docs / (num_docs_containing_china + 1)
history_idf['china'] = num_docs / (num_docs_containing_china + 1)

In [13]:
print(intro_idf)
print(history_idf)

{'and': 0.6666666666666666, 'kite': 0.6666666666666666, 'china': 1.0}
{'and': 0.6666666666666666, 'kite': 0.6666666666666666, 'china': 1.0}


Ahora calcularemos TF-IDF, sin logaritmo natural

In [14]:
intro_tfidf = {}
intro_tfidf['and'] = intro_tf['and']*intro_idf['and']
intro_tfidf['kite'] = intro_tf['kite']*intro_idf['kite']
intro_tfidf['china'] = intro_tf['china']*intro_idf['china']

In [15]:
history_tfidf = {}
history_tfidf['and'] = history_tf['and']*history_idf['and']
history_tfidf['kite'] = history_tf['kite']*history_idf['kite']
history_tfidf['china'] = history_tf['china']*history_idf['china']

In [16]:
print(intro_tfidf)
print(history_tfidf)

{'and': 0.018365472910927456, 'kite': 0.02938475665748393, 'china': 0.0}
{'and': 0.0202020202020202, 'kite': 0.01346801346801347, 'china': 0.010101010101010102}


Calculando con logaritmo

In [17]:
import math

intro_tfidf['and'] = intro_tf['and']*math.log(intro_idf['and'])
intro_tfidf['kite'] = intro_tf['kite']*math.log(intro_idf['kite'])
intro_tfidf['china'] = intro_tf['china']*math.log(intro_idf['china'])

history_tfidf['and'] = history_tf['and']*math.log(history_idf['and'])
history_tfidf['kite'] = history_tf['kite']*math.log(history_idf['kite'])
history_tfidf['china'] = history_tf['china']*math.log(history_idf['china'])

In [18]:
print(intro_tfidf)
print(history_tfidf)

{'and': -0.011169837688930151, 'kite': -0.01787174030228824, 'china': 0.0}
{'and': -0.012286821457823165, 'kite': -0.008191214305215444, 'china': 0.0}


Porque se le añade logaritmo natural? por la ley de zipf
Si un corpus es suficientemente grande, la palabra más frecuente aparece el doble de veces que la segunda palabra más frecuente (relacion exponencial), entonces añadiendole el logaritmo se elimina esa exponencialidad.

In [19]:
import nltk
nltk.download('brown')
from nltk.corpus import brown
len(brown.words())

[nltk_data] Downloading package brown to
[nltk_data]     D:\Users\Memo\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!


1161192

Corpus bastante grande!
Aqui podemos ver la ley de zipf

In [20]:
from collections import Counter
puncs = set((',', '.', '--', '-', '!', '?', ':', ';', '``', "''", '(', ')', '[', ']'))
word_list = (x.lower() for x in brown.words() if x not in puncs)
token_counts = Counter(word_list)
token_counts.most_common(20)

[('the', 69971),
 ('of', 36412),
 ('and', 28853),
 ('to', 26158),
 ('a', 23195),
 ('in', 21337),
 ('that', 10594),
 ('is', 10109),
 ('was', 9815),
 ('he', 9548),
 ('for', 9489),
 ('it', 8760),
 ('with', 7289),
 ('as', 7253),
 ('his', 6996),
 ('on', 6741),
 ('be', 6377),
 ('at', 5372),
 ('by', 5306),
 ('i', 5164)]

Utilizando el mismo corpus que la clase pasada

In [23]:
from nltk.tokenize import TreebankWordTokenizer
from collections import Counter

tokenizer = TreebankWordTokenizer()
docs = ["La UFRO está en Temuco, y yo estudio en la ufro."]
docs.append("La Ufro es una universidad estatal.")
docs.append("Facultad de Ingeniería y Ciencia, Ufro.")
print(docs)

doc_tokens = []
tokens_in_doc = []
num_tokens_in_doc = []
for doc in docs:
    tokens = tokenizer.tokenize(doc.lower())
    doc_tokens += [sorted(tokenizer.tokenize(doc.lower()))]
    token_counts = Counter(tokens)
    tokens_in_doc.append(tokens)
    num_tokens_in_doc.append(len(token_counts))

#todos los token de los docs
all_doc_tokens = sum(doc_tokens, [])

# Léxico
lexicon = sorted(set(all_doc_tokens))

from collections import OrderedDict

zero_vector = OrderedDict((token, 0) for token in lexicon)

['La UFRO está en Temuco, y yo estudio en la ufro.', 'La Ufro es una universidad estatal.', 'Facultad de Ingeniería y Ciencia, Ufro.']


Calculamos el diccionario que contiene en cuantos documentos aparece cada token, para calcular el IDF

In [26]:
num_documents_containing_token = {}
for token in lexicon:
    containing_token = 0
    for i, doc in enumerate(docs):
        if token in tokens_in_doc[i]:
            containing_token += 1
    num_documents_containing_token.update({token : containing_token})

In [27]:
import copy
import math

document_tfidf_vectors = []
for i, doc in enumerate(docs):
    vec = copy.copy(zero_vector)
    tokens = tokenizer.tokenize(doc.lower())
    token_counts = Counter(tokens)
    for key, value in token_counts.items():
        tf = value / num_tokens_in_doc[i]
        idf = len(docs) / num_documents_containing_token[token]
        vec[key] = tf * math.log(idf)
    document_tfidf_vectors.append(vec)
    
print(document_tfidf_vectors)

[OrderedDict([(',', 0.10986122886681099), ('.', 0.10986122886681099), ('ciencia', 0), ('de', 0), ('en', 0.21972245773362198), ('es', 0), ('estatal', 0), ('estudio', 0.10986122886681099), ('está', 0.10986122886681099), ('facultad', 0), ('ingeniería', 0), ('la', 0.21972245773362198), ('temuco', 0.10986122886681099), ('ufro', 0.21972245773362198), ('una', 0), ('universidad', 0), ('y', 0.10986122886681099), ('yo', 0.10986122886681099)]), OrderedDict([(',', 0), ('.', 0.15694461266687282), ('ciencia', 0), ('de', 0), ('en', 0), ('es', 0.15694461266687282), ('estatal', 0.15694461266687282), ('estudio', 0), ('está', 0), ('facultad', 0), ('ingeniería', 0), ('la', 0.15694461266687282), ('temuco', 0), ('ufro', 0.15694461266687282), ('una', 0.15694461266687282), ('universidad', 0.15694461266687282), ('y', 0), ('yo', 0)]), OrderedDict([(',', 0.13732653608351372), ('.', 0.13732653608351372), ('ciencia', 0.13732653608351372), ('de', 0.13732653608351372), ('en', 0), ('es', 0), ('estatal', 0), ('estudio

In [29]:
print(document_tfidf_vectors[:2])

[OrderedDict([(',', 0.10986122886681099), ('.', 0.10986122886681099), ('ciencia', 0), ('de', 0), ('en', 0.21972245773362198), ('es', 0), ('estatal', 0), ('estudio', 0.10986122886681099), ('está', 0.10986122886681099), ('facultad', 0), ('ingeniería', 0), ('la', 0.21972245773362198), ('temuco', 0.10986122886681099), ('ufro', 0.21972245773362198), ('una', 0), ('universidad', 0), ('y', 0.10986122886681099), ('yo', 0.10986122886681099)]), OrderedDict([(',', 0), ('.', 0.15694461266687282), ('ciencia', 0), ('de', 0), ('en', 0), ('es', 0.15694461266687282), ('estatal', 0.15694461266687282), ('estudio', 0), ('está', 0), ('facultad', 0), ('ingeniería', 0), ('la', 0.15694461266687282), ('temuco', 0), ('ufro', 0.15694461266687282), ('una', 0.15694461266687282), ('universidad', 0.15694461266687282), ('y', 0), ('yo', 0)])]


## Calculo de la similitud coseno

In [30]:
import math

def sim_coseno(vec1, vec2):
    vec1 = [val for val in vec1.values()]
    vec2 = [val for val in vec2.values()]
    
    dot_prod = 0
    for i, v in enumerate(vec1):
        dot_prod += v*vec2[i]
        
    norm_1 = math.sqrt(sum([x**2 for x in vec1]))
    norm_2 = math.sqrt(sum([x**2 for x in vec2]))
    
    return dot_prod / (norm_1 * norm_2)

In [32]:
sim_coseno(document_tfidf_vectors[0], document_tfidf_vectors[1])

0.4335549847620599

In [34]:
import numpy as np

similitud_tfidf = [[sim_coseno(doc1, doc2) for doc2 in document_tfidf_vectors] for doc1 in document_tfidf_vectors]
similitud_tfidf = np.array(similitud_tfidf)
np.fill_diagonal(similitud_tfidf, 0)
similitud_tfidf


array([[0.        , 0.43355498, 0.40555355],
       [0.43355498, 0.        , 0.26726124],
       [0.40555355, 0.26726124, 0.        ]])

### Libreria sklearn
Python posee todo esto programado en 4 lineas

In [35]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = docs
vectorizer = TfidfVectorizer(min_df=1, encoding='utf-8')
model = vectorizer.fit_transform(corpus)

In [36]:
print(model.todense())

[[0.         0.         0.5844829  0.         0.         0.29224145
  0.29224145 0.         0.         0.44451431 0.29224145 0.34520502
  0.         0.         0.29224145]
 [0.         0.         0.         0.45050407 0.45050407 0.
  0.         0.         0.         0.34261996 0.         0.26607496
  0.45050407 0.45050407 0.        ]
 [0.47952794 0.47952794 0.         0.         0.         0.
  0.         0.47952794 0.47952794 0.         0.         0.28321692
  0.         0.         0.        ]]
