# Text Mining of BBC News Data

## Part 2: Bag-of-Words Text Vectorization


## The Analyzer Object

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
count_vectorizer = CountVectorizer()
count_vectorizer

In [None]:
test_sentence = "C'est l'été au Brésil!"

In [None]:
word_analyzer = CountVectorizer().build_analyzer()
word_analyzer(test_sentence)

In [None]:
word_analyzer = CountVectorizer(strip_accents="unicode", ngram_range=(2, 2)).build_analyzer()
word_analyzer(test_sentence)

In [None]:
word_analyzer = CountVectorizer(strip_accents="unicode", ngram_range=(1, 2)).build_analyzer()
word_analyzer(test_sentence)

In [None]:
word_analyzer = CountVectorizer(ngram_range=(2, 2)).build_analyzer()
word_analyzer(test_sentence)

## Analyzer = Preprocessor + Tokenizer (+ Token Filtering) (+ n-grams extraction)

In [None]:
vectorizer = CountVectorizer(strip_accents="unicode", lowercase=True)

In [None]:
test_sentence = "C'est l'été au Brésil!"

### Exercises

- Type `vectorizer.build_<TAB>` to see the list of methods of the vectorizer object;

- Use the vectorizer to build a preprocessor object and apply it to the test sentence: which transformations do you notice?

- Use the vectorizer to build a tokenizer object and apply it to the preprocessed test sentence from the previous step;

- Compare the results of the previous two steps with the output of the analyzer applied to the original test sentence.

In [None]:
# %load notebook_solutions/build_preprocessor.py

In [None]:
# %load notebook_solutions/build_tokenizer.py

## Vectorization of a Full Dataset

In [None]:
from pathlib import Path


bbc_folder_path = Path("bbc")
text_filepaths = sorted(bbc_folder_path.glob("*/*.txt"))

Instead of decoding the text manually as we did before, we will path the filenames directly to the vectorizer and let it decode using the encoding of our choice and ignore decoding errors (as we did previously):

In [None]:
%%time

vectorizer = CountVectorizer(encoding="utf-8", input="filename",
                             decode_error="ignore")

vectorizer.fit(text_filepaths)

In [None]:
type(vectorizer.vocabulary_)

In [None]:
len(vectorizer.vocabulary_)

In [None]:
vocabulary_items = sorted(vectorizer.vocabulary_.items())

In [None]:
vocabulary_items[:10]

In [None]:
vocabulary_items[10000:10010]

In [None]:
vocabulary_items[-10:]

**Question**:

- Why do you thing the result of the text vectorization procedure is called "Bag-of-Words" representation?

- What as the main limitation of the "Bag-of-Words" representation?

- Can you come up with a pair of sentence with completely different meanings but that would share the same Bag of Words vector?

## TF-IDF Normalization

- TF-IDF stands for **"Term Frequency" - "Inverse Document Frequency"**.

- **Term Frequency** is the number of times a term or token appears in a given document;

- **Document Frequency** is the number of times a term or token appears across all the documents of the corpus.


**Note**: depending on the context, frequencies can either be:

- **absolute** values (**integer counts**) or
- **relative** values (**floating point numbers between 0 and 1** for **ratio** between two quantities).