# Text Analysis

You must convert words into a numerical representation to be able to use them in ML models

**Bag of words (Bow)**
Represents text as a word frequency matrix
Scikit-learn offers:
* Split text into tokens (Usually whole words)
* Counts the ocurrence of each token
* Assign values in a vector based on ocurrences

**Conclusion**
* **CountVectorizer** when the absolute number of occurrences of each word in the text is important
* **TfidfVectorizer** when the importance of each word is weighted based on the frequency with which it appears in the corpus
* **HashingVectorizer** is useful for working with very large data sets that do not fit in memory

In [1]:
corpus = [
    "Scikit-learn nos ayuda a trabajar con texto",
    "Parte el texto en tokens, generalmente palabras completas",
    "Cuenta las ocurrencias de cada uno de estos tokens"
]

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

In [4]:
count_vectorizer = CountVectorizer()
count_vectorizer.fit(corpus)

In [5]:
transformed_corpus = count_vectorizer.transform(corpus)

In [6]:
transformed_corpus

<3x21 sparse matrix of type '<class 'numpy.int64'>'
	with 23 stored elements in Compressed Sparse Row format>

In [7]:
transformed_corpus.todense()

matrix([[1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0],
        [0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0],
        [0, 1, 0, 0, 1, 2, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1]])

In [8]:
count_vectorizer.vocabulary_

{'scikit': 16,
 'learn': 11,
 'nos': 12,
 'ayuda': 0,
 'trabajar': 19,
 'con': 3,
 'texto': 17,
 'parte': 15,
 'el': 6,
 'en': 7,
 'tokens': 18,
 'generalmente': 9,
 'palabras': 14,
 'completas': 2,
 'cuenta': 4,
 'las': 10,
 'ocurrencias': 13,
 'de': 5,
 'cada': 1,
 'uno': 20,
 'estos': 8}


You can use inverse transform to get the token from a matrix of vectors, but you have to be careful because the order of the words is lost during the transformations

In [9]:
count_vectorizer.inverse_transform(transformed_corpus)

[array(['ayuda', 'con', 'learn', 'nos', 'scikit', 'texto', 'trabajar'],
       dtype='<U12'),
 array(['completas', 'el', 'en', 'generalmente', 'palabras', 'parte',
        'texto', 'tokens'], dtype='<U12'),
 array(['cada', 'cuenta', 'de', 'estos', 'las', 'ocurrencias', 'tokens',
        'uno'], dtype='<U12')]

In [10]:
modified_count_vectorizer = CountVectorizer(
    binary=True,
    max_features=10
)
modified_count_vectorizer.fit(corpus)
modified_count_vectorizer.transform(corpus).todense()

matrix([[1, 1, 1, 0, 0, 0, 1, 1, 0, 1],
        [0, 0, 0, 0, 1, 1, 0, 1, 1, 0],
        [0, 0, 0, 1, 0, 0, 0, 0, 1, 0]])

---
You can change the way the tokens are divided

In [14]:
import re


def emoji_tokenizer(text: str) -> list:
    emojis = re.findall(r'[\U0001F000-\U0001F6FF]', text)
    return emojis

print(emoji_tokenizer("I 💙 🍕"))

['💙', '🍕']


In [15]:
emoji_vectorizer = CountVectorizer(tokenizer=emoji_tokenizer)

In [16]:
emoji_corpus = [
    "I 💙 🍕",
    "This 🍕 was 👎",
    "I like either 🍕 or 🍔, but not 🌭",
]

In [17]:
X = emoji_vectorizer.fit_transform(emoji_corpus)



In [18]:
# Print the feature names and the count matrix
print(emoji_vectorizer.vocabulary_)
print(X.toarray())

{'💙': 4, '🍕': 2, '👎': 3, '🍔': 1, '🌭': 0}
[[0 0 1 0 1]
 [0 0 1 1 0]
 [1 1 1 0 0]]


---
You can calculate the relative importance of a word based on the frequency of the word in the entire corpus

In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [25]:
corpus = [
    "Scikit-learn nos ayuda a trabajar con texto",
    "Parte el texto en tokens, generalmente palabras completas",
    "Cuenta las ocurrencias de cada uno de estos tokens"
]

In [26]:
tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit(corpus)

In [27]:
transformed_corpus = tfidf_vectorizer.transform(corpus)

In [28]:
transformed_corpus

<3x21 sparse matrix of type '<class 'numpy.float64'>'
	with 23 stored elements in Compressed Sparse Row format>

In [29]:
transformed_corpus.todense()

matrix([[0.38988801, 0.        , 0.        , 0.38988801, 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.38988801, 0.38988801, 0.        , 0.        ,
         0.        , 0.38988801, 0.29651988, 0.        , 0.38988801,
         0.        ],
        [0.        , 0.        , 0.37380112, 0.        , 0.        ,
         0.        , 0.37380112, 0.37380112, 0.        , 0.37380112,
         0.        , 0.        , 0.        , 0.        , 0.37380112,
         0.37380112, 0.        , 0.28428538, 0.28428538, 0.        ,
         0.        ],
        [0.        , 0.30746099, 0.        , 0.        , 0.30746099,
         0.61492198, 0.        , 0.        , 0.30746099, 0.        ,
         0.30746099, 0.        , 0.        , 0.30746099, 0.        ,
         0.        , 0.        , 0.        , 0.23383201, 0.        ,
         0.30746099]])

In [40]:
tfidf_vectorizer.vocabulary_

{'scikit': 16,
 'learn': 11,
 'nos': 12,
 'ayuda': 0,
 'trabajar': 19,
 'con': 3,
 'texto': 17,
 'parte': 15,
 'el': 6,
 'en': 7,
 'tokens': 18,
 'generalmente': 9,
 'palabras': 14,
 'completas': 2,
 'cuenta': 4,
 'las': 10,
 'ocurrencias': 13,
 'de': 5,
 'cada': 1,
 'uno': 20,
 'estos': 8}

---
You can calculate the relative importance of a word based on the frequency of the word in the entire corpus

In [32]:
from sklearn.feature_extraction.text import HashingVectorizer

In [33]:
corpus = [
    "Scikit-learn nos ayuda a trabajar con texto",
    "Parte el texto en tokens, generalmente palabras completas",
    "Cuenta las ocurrencias de cada uno de estos tokens"
]

In [34]:
hashing_vectorizer = HashingVectorizer()
hashing_vectorizer.fit(corpus)

In [35]:
transformed_corpus = hashing_vectorizer.transform(corpus)

In [36]:
transformed_corpus

<3x1048576 sparse matrix of type '<class 'numpy.float64'>'
	with 23 stored elements in Compressed Sparse Row format>

In [37]:
transformed_corpus.todense()

matrix([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]])