# Bag of Words and TF-IDF
Below, we'll look at three useful methods of vectorizing text.
- `CountVectorizer` - Bag of Words
- `TfidfTransformer` - TF-IDF values
- `TfidfVectorizer` - Bag of Words AND TF-IDF values

Let's first use an example from earlier and apply the text processing steps we saw in this lesson.

In [1]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [2]:
corpus = ["The first time you see The Second Renaissance it may look boring.",
        "Look at it at least twice and definitely watch part 2.",
        "It will change your view of the matrix.",
        "Are the human people the ones who started the war?",
        "Is AI a bad thing ?"]

In [3]:
stop_words = stopwords.words("english")
lemmatizer = WordNetLemmatizer()

Use the skills you learned so far to create a function `tokenize` that takes in a string of text and applies the following:
- case normalization (convert to all lowercase)
- punctuation removal
- tokenization, lemmatization, and stop word removal using `nltk`

Feel free to refer back to previous sections to complete these steps!

In [8]:
def tokenize(text):
    # Todo: normalize case and remove punctuation
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    
    # Todo: tokenize text
    tokens = word_tokenize(text)
    
    # Todo: lemmatize and remove stop words
    tokens = [WordNetLemmatizer().lemmatize(w) for w in tokens]

    return tokens

In [11]:
# test the tokenize function
print(tokenize(corpus[0]))

['the', 'first', 'time', 'you', 'see', 'the', 'second', 'renaissance', 'it', 'may', 'look', 'boring']


# `CountVectorizer` (Bag of Words)

In this section, you will count and vectorize the tokenized words from above.

Use the [`CounterVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) in scikit-learn to convert the corpus to a matrix of token counts.

In [27]:
from sklearn.feature_extraction.text import CountVectorizer

# Todo: initialize count vectorizer object and pass the tokenize function to the `tokenizer` parameter
vect = CountVectorizer(tokenizer=tokenize)

In [28]:
# Todo: get counts of each token (word) in text data (corpus) using the fit_transform method
X = vect.fit_transform(corpus)

In [29]:
# convert sparse matrix to numpy array to view the counts of each token (word)
# each row is one line in the text (corpus) and the number is the count of a token
X.toarray()

array([[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0,
        1, 1, 1, 0, 2, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0],
       [1, 0, 0, 1, 0, 2, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0,
        0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0,
        0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
        0, 0, 0, 1, 3, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0],
       [0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int64)

In [30]:
# view token vocabulary in the format of {token: feature indices}
vect.vocabulary_

{'the': 26,
 'first': 10,
 'time': 28,
 'you': 35,
 'see': 24,
 'second': 23,
 'renaissance': 22,
 'it': 13,
 'may': 17,
 'look': 15,
 'boring': 7,
 'at': 5,
 'least': 14,
 'twice': 29,
 'and': 3,
 'definitely': 9,
 'watch': 32,
 'part': 20,
 '2': 0,
 'will': 34,
 'change': 8,
 'your': 36,
 'view': 30,
 'of': 18,
 'matrix': 16,
 'are': 4,
 'human': 11,
 'people': 21,
 'one': 19,
 'who': 33,
 'started': 25,
 'war': 31,
 'is': 12,
 'ai': 2,
 'a': 1,
 'bad': 6,
 'thing': 27}

If you did it right, you will see that the count for "look" in the matrix is 1 in the first and the second row because the token "look" appears once in the first line and once in the second line.  

# `TfidfTransformer`

In this section, you will use the [`TfidfTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html) in sciket-learn to transform the count matrix above to a normalized representation.

In [31]:
from sklearn.feature_extraction.text import TfidfTransformer

# Todo: initialize tf-idf transformer object. Set smooth_idf parameter to false.
transformer = TfidfTransformer(smooth_idf=False)

In [32]:
# Todo: use counts from count vectorizer results to compute tf-idf values using the fit_transform method
tfidf = transformer.fit_transform(X)

In [33]:
# convert sparse matrix to numpy array to view
# you can see that the counts are normalized
tfidf.toarray()

array([[ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.31287579,  0.        ,  0.        ,
         0.31287579,  0.        ,  0.        ,  0.18115041,  0.        ,
         0.22976633,  0.        ,  0.31287579,  0.        ,  0.        ,
         0.        ,  0.        ,  0.31287579,  0.31287579,  0.31287579,
         0.        ,  0.36230083,  0.        ,  0.31287579,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.31287579,  0.        ],
       [ 0.29019634,  0.        ,  0.        ,  0.29019634,  0.        ,
         0.58039269,  0.        ,  0.        ,  0.        ,  0.29019634,
         0.        ,  0.        ,  0.        ,  0.16801935,  0.29019634,
         0.21311125,  0.        ,  0.        ,  0.        ,  0.        ,
         0.29019634,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.29019634,
         0.     

# `TfidfVectorizer`
In this section, we will show you how to use a `TfidfVectorizer` object. This object does all the work in `CountVectorizer` and `TfidfTransformer` in one step.

`TfidfVectorizer` = `CountVectorizer` + `TfidfTransformer`

In [34]:
from sklearn.feature_extraction.text import TfidfVectorizer

# initialize tf-idf vectorizer object
vectorizer = TfidfVectorizer()

In [35]:
# compute bag of word counts and tf-idf values
X = vectorizer.fit_transform(corpus)

In [36]:
# convert sparse matrix to numpy array to view
X.toarray()

array([[ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.30298183,  0.        ,  0.        ,  0.30298183,  0.        ,
         0.        ,  0.20291046,  0.        ,  0.24444384,  0.        ,
         0.30298183,  0.        ,  0.        ,  0.        ,  0.        ,
         0.30298183,  0.30298183,  0.30298183,  0.        ,  0.40582093,
         0.        ,  0.30298183,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.30298183,  0.        ],
       [ 0.        ,  0.30015782,  0.        ,  0.60031564,  0.        ,
         0.        ,  0.        ,  0.30015782,  0.        ,  0.        ,
         0.        ,  0.20101919,  0.30015782,  0.24216544,  0.        ,
         0.        ,  0.        ,  0.        ,  0.30015782,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.30015782,  0.        ,  0.        ,
         0.30015782,  0.        ,  0.        ,  0.