# Representing Text

We can represent text in many ways: character strings are a standard representation, but we can also create numerical representations of text. In this and the next few notebooks, we will explore and discuss a few of these representations to motivate our discussions of *embeddings*. Embeddings are a representation of text that will help us determine similarity between two blurbs (phrases, sentences, paragraphs, etc.) of text.

<div>
<img src="img/02_text_representation_distance.png" width="600"/>
</div>

Image source: Mastering Text Similarity ([Guadagnolo, 2024](https://medium.com/eni-digitalks/mastering-text-similarity-combining-embedding-techniques-and-distance-metrics-98d3bb80b1b6))

## Vectorization

The process of converting text into a numerical vectors is sometimes called vectorization. 

A simple form of vectorization is to count the number of words in a phrase. [`CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) from scikit-learn helps achieve this:

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np


phrases = ["cats are fun", "dogs are also fun", "ice cream is great"]

vectorizer = CountVectorizer()
x = vectorizer.fit_transform(phrases)
print(x.toarray())


[[0 1 1 0 0 1 0 0 0]
 [1 1 0 0 1 1 0 0 0]
 [0 0 0 1 0 0 1 1 1]]


Adding column labels via a pandas data frame, it is easier to understand the operation:

In [2]:
import pandas as pd
df = pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names_out())
print(df)

   also  are  cats  cream  dogs  fun  great  ice  is
0     0    1     1      0     0    1      0    0   0
1     1    1     0      0     1    1      0    0   0
2     0    0     0      1     0    0      1    1   1


## Why does it matter?

Using count vectorization, we can calculate the vectors' cosine similarity.

In [3]:
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity(x)
print(similarity)

[[1.         0.57735027 0.        ]
 [0.57735027 1.         0.        ]
 [0.         0.         1.        ]]


The cosine similarity between two vectors is the dot product normalized by the norms of each vector (see, for example, [this discussion](https://nlp.stanford.edu/IR-book/html/htmledition/dot-products-1.html) and [sklearn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html)). This means:

+ The phrase `"cats are fun"` is represented by the vector `[0    1     1     0    1]`.
+ The phrase `"dogs are also fun"` is represented by       `[1    1     0     1    1]`.
+ The dot product of these vectors is the sum of the pair-wise product of their elements: `0*1 + 1*1 + 1*0 + 0*1 + 1*1 = 2`.
+ The norm of each vector is the usual Eucledean norm: `sqrt(0^2 + 1^2 + 1^2 + 0^2  + 1^2)` and `sqrt(1^2 + 1^2 + 0^2 + 1^2  + 1^2)`, respectively.

In [4]:
d = 0*1 + 1*1 + 1*0 + 0*1 + 1*1
x = np.sqrt(0**2 + 1**2 + 1**2 + 0**2  + 1**2)
y = np.sqrt(1**2 + 1**2 + 0**2 + 1**2  + 1**2)
d/(x*y)

np.float64(0.5773502691896258)

Using this simple method, we obtain a metric that will tend to 1 as the vectors are more similar to each other, while they will tend to 0 when they are more dissimilar. When using CountVectorizer, we give the same weight to each word, regardless of the relative importance in the corpus (the group of documents or phrases).

# tf-idf Vectorization

We can enhance the similarity metric by counting better: we want to give more importance to rarer words that are uncommon in the corpus. This way we can reduce the relative importance of very common works (ex., "the", "a", "is", etc.) which can carry little meaning.

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np



vectorizer = TfidfVectorizer()
x = vectorizer.fit_transform(phrases)
print(x.toarray())


[[0.         0.51785612 0.68091856 0.         0.         0.51785612
  0.         0.         0.        ]
 [0.5628291  0.42804604 0.         0.         0.5628291  0.42804604
  0.         0.         0.        ]
 [0.         0.         0.         0.5        0.         0.
  0.5        0.5        0.5       ]]


From [sklearn's documentation](https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting):

>In a large text corpus, some words will be very present (e.g. “the”, “a”, “is” in English) hence carrying very little meaningful information about the actual contents of the document. If we were to feed the direct count data directly to a classifier those very frequent terms would shadow the frequencies of rarer yet more interesting terms.
>
>In order to re-weight the count features into floating point values suitable for usage by a classifier it is very common to use the tf–idf transform.

tf-idf means the product of Term Frequency (tf) and Inverse Document Frequency (idf):

+ Term frequency is the number of times that a token (a word in the example above) appears in a document.
+ Inverse document frequency is given by 

$$
idf(t) = log \frac{1+n}{n+df(t)} +1.
$$
+ In the equation above, $n$ is the total number of documents, and $df(t)$ is the number of documents in the document set that contain the term $t$.
+ The resulting tf-idf vectors are normalized by the norm.

In [6]:
similarity = cosine_similarity(x)
print(similarity)

[[1.         0.44333251 0.        ]
 [0.44333251 1.         0.        ]
 [0.         0.         1.        ]]
