# Trabajando con texto

Welcome to this chapter of your machine learning book with scikit-learn.

You've probably heard of something called ChatGPT. Well, here are its principles: in text analysis.

Text analysis is a topic that probably deserves its own book, but we can start laying the foundations using scikit-learn.

First, it's important to remember that written words are not directly consumable by machine learning models; we need to convert them into a numerical representation that we can process.

## Bag of words – CountVectorizer

One of the most common ways to do this is by using the bag of words (BoW) technique, which converts text into a word frequency matrix.

To accomplish this, scikit-learn offers us a utility that performs the following for us:

 - Splits the text into tokens, generally complete words,
 - Counts the occurrences of each of these tokens,
 - Assigns values within a vector according to the number of occurrences of each token in our input data.
This is done through the vectorizer known as `CountVectorizer`. To see it in action, we first need to generate a dataset. By the way, a dataset is often called a corpus, and each of its individual elements is known as a document.

So let's generate a corpus with three documents:

In [None]:
corpus = [
    "Scikit-learn nos ayuda a trabajar con texto",
    "Parte el texto en tokens, generalmente palabras completas",
    "Cuenta las ocurrencias de cada uno de estos tokens"
]


We import the vectorizer – note that we are importing from `sklearn.feature_extraction.text`:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer


We create an object with default values – and train it with our corpus:

In [None]:
count_vectorizer = CountVectorizer()
count_vectorizer.fit(corpus)


If we call the `transform` method passing our corpus to it, the result is as expected: a sparse matrix since that is the best representation of our data. We can convert it to a NumPy array using its `todense` method.

In [None]:
transformed_corpus = count_vectorizer.transform(corpus)
transformed_corpus.todense()


Yes, it's a matrix with a lot of zeros. If you want to see which word corresponds to each column, you can access the computed property `vocabulary_`, so you can see that the word "help" corresponds to column zero and the word "one" corresponds to column number 20.

## Inverse transformation

The `CountVectorizer` offers us the `inverse_transform` method so that from a matrix of vectors, you can recover the tokens you used at the input. Be careful because although it is an inverse transformation, it is not so reliable, since one of the disadvantages of this family of vectorizers is that the order of words is lost.

We can test it by calling the method with the matrix we just obtained:

In [None]:
count_vectorizer.inverse_transform(transformed_corpus)


## Extra Parameters

The `CountVectorizer` is one of the first classes that has a large number of parameters to configure its behavior. Among the most common ones I've seen used are:

 - `binary`, which by default has a value of `False`. When this value is true, the `CountVectorizer` behaves more like a one-hot encoder, and our resulting matrix consists of ones and zeros.
 - `max_features`, this can be a number that indicates the maximum number of columns we want in our matrix. In the previous example, we had a matrix of 21 columns as a result, but if we had set `max_features` to a value of 10, we would have a matrix of 10 columns as a result, where those ten columns would contain the 10 most frequent tokens.
 - `max_df` and `min_df`, these parameters allow us to eliminate words that are overrepresented and underrepresented in our corpus. These values can be floats, ranging from 0 to 1 if we want to use them as a proportion, or we can use them as integers if we want to count occurrences directly.
For example, if we create a vectorizer with the following arguments:

In [None]:
modified_count_vectorizer = CountVectorizer(
    binary = True,
    max_features = 6,
    min_df = 1
)


You will see that as a result of transforming our corpus, we obtain a matrix of 6 columns, filled with only ones and zeros – and that the vocabulary is composed solely of the six most frequent tokens:

In [None]:
new_result = modified_count_vectorizer.fit_transform(corpus)
print(modified_count_vectorizer.vocabulary_)
new_result.todense()


## Changing the tokenizer

There are times when we want to have more control over how the documents in our corpus should be broken down into tokens. For example, if we are dealing with another language, or we are working with text that contains emojis.

In [None]:
import re

# Un tokenizador que mantiene solo los emojis
def emoji_tokenizer(text):
    emojis = re.findall(r'[\U0001F000-\U0001F6FF]', text)
    return emojis

print(emoji_tokenizer("I 💚 🍕"))


We use the tokenizer by passing it to `CountVectorizer` in the `tokenizer` argument:

In [None]:
emoji_corpus = [
    "I 💚 🍕",
    "This 🍕 was 👎",
    "I like either 🍕 or 🍔, but not 🌭",
]

emoji_vectorizer = CountVectorizer(tokenizer=emoji_tokenizer)
X = emoji_vectorizer.fit_transform(emoji_corpus)

# Print the feature names and the count matrix
print(emoji_vectorizer.vocabulary_)
print(X.toarray())


## TF-IDF Weighting – TfidfVectorizer

In a large text corpus, some words will be heavily overrepresented (such as "the", "a", "is") and therefore have little meaningful information about the actual content of the document – if these words exist in all documents, they are not as useful.

If we take the results of a `CountVectorizer` in these cases, we run the risk of passing unnecessary information to our model, thus obscuring less frequent, rarer, and more interesting words. To address this problem, we can use text feature extraction techniques that weight words based on their relative importance in the corpus.

A common technique used for text feature extraction is term frequency – inverse document frequency or (TF-IDF), which measures the relative importance of a word in a document based on the frequency of that word in the corpus as a whole.

Scikit-learn offers us a class called `TfidfVectorizer` that has the same external behavior as the `CountVectorizer`: it receives a set of texts and gives us a corresponding matrix. It also has almost the same arguments.

```{hint} 
As homework, try the examples we saw but using the `TfidfVectorizer`, tell me in the comments what differences you notice?

```
## Feature hashing

Another way to convert text to numerical representation is through the use of the *hashing trick*, remember we saw it recently? – for this case, scikit-learn has another class for us: `HashingVectorizer`, which shares many characteristics with the two vectorizers previously seen. Obviously, it has the same limitations that we already know, however it can be a good alternative in some scenarios – ah, remember, using *feature hashing* it's impossible to return to the original values.

```{hint}
Another homework: try the examples we saw but using the `HashingVectorizer`, tell me in the comments what differences you notice?

```
## Conclusion

We've seen multiple ways to transform our text into numbers, `CountVectorizer` in count mode or binary mode, `TfidfVectorizer` and `HashingVectorizer`. The method you should use will depend on your use case, but you can follow these general rules:

 - `**CountVectorizer**` is useful for creating a word count matrix when the absolute number of occurrences of each word in the text is important.
 - `**TfidfVectorizer**` is useful when weighting the importance of each word in the text based on how often it appears in the corpus.
 - `**HashingVectorizer**` is useful for working with very large datasets that don't fit in memory and for reducing the dimensionality of the feature space.
I hope now it's clearer to you what the first step is before trying to feed text into your machine learning models, remember that you must practice and in the resources you'll find practical examples where some vectorizers are used.

In the following chapters we'll see how we can work with categorical and numerical variables. I'll see you there.