# From text to matrices

## Documents as vectors
Unfortunately, computers find it hard to read texts. They like numbers more. We can't just feed it the tokens but have to transform each sentence to a **vector**.

A vector is just a list of numbers, such as [0, 10, 1, 15]. 

How to convert a text to a series of numbers is much debated. Below we show you the easiest and most common scenario: the **bag-of-words** approach.

This approach assumes that a document can be adequately represented by simply counting the words they contain. We represent the document numerically by collecting the **token frequencies**. For example, the code below converts a sentence to a vector of term frequencies


In [None]:
from collections import Counter
fw = Counter(preprocess(sentence).split())
print(fw)

In [None]:
print(list(fw.values()))

In [None]:
We can vectorize all documents, and construct a **document-term matrix**. A matrix is nothing more than a collection of individual vectors, stacked as rows on top of each other. 

Image our corpus consists of just two sentences: "I like food", "Cats like like food"

Using the bag-of-words approach we can convert this corpus to the following document-term matrix.

In [None]:
pd.DataFrame([[1,0,1,1],[0,1,1,2]],
              columns=["i","cats","food","like"], 
              index=['i like food','cats like like food'])


In [None]:
  We can do the same for the sentences we stored in the `SentenceProcessed` column. And the good news is that you don't have to write much of the code, because `sklearn` has provided you with many tools that simplify this task a lot.

  The cells below show how to vectorize your documents and generate a document-term matrix from your corpus. 

  The `CountVectorizer` class will convert each document to a vector of its token frequencies, just as in the previous example. Load the `CountVectorizer` by running the cell below.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
# inspect the documentation
?CountVectorizer

In [None]:
As you noticed, the `CountVectorizer` has many arguments. Late on, you can adjust them and see how changing these settings improves or harms the performance of the classifier.

We suggest having a closer look at:
- `min_df` and `max_df`: discard words based on their document frequency. Words that occur only once or twice probably won't be important for predicting the label of a document. Discarding more frequent words is trickier and depends on the task at hand (sometimes function words convey important information!)
- `ngram_range`: n-grams are chunks of n consecutive words. The bag-of-words approach largely ignores the order in which words appear. However, we retain some information on order by counting bigrams (or trigrams). For example,  a bigram model will contain the phrase "not sad" whereas a unigram model won't capture this negation (it counts "not" and "sad" separately).

In [None]:
The code below converts all our processed documents into a document terms matrix (more specifically a dense matrix)

We first create `vectorizer` an instance of the `CountVectorizer` class for which specified many of the arguments.

In [None]:
vectorizer = CountVectorizer(min_df=5, 
                             max_df=0.9,
                             ngram_range=(1,2),
                             token_pattern=r"\S+")

In [None]:
What about the `token_pattern` argument you might wonder? Well, since we already tokenized the data, the whitespaces effectively indicate word boundaries. A token is everything between two whitespaces (or sentence boundaries). This pattern is matched by the regular expression "\S+" (sequences of everything except whitespace).

You can check it for yourself, running the code below:

In [None]:
import re
pattern = re.compile(r"\S+")
print(pattern.findall(df.iloc[0].SentenceProcessed)[:10])

In [None]:
We can convert, as an example, the first hundred sentences using the `.fit_transform()` method. 

In [None]:
dtm = vectorizer.fit_transform(df.iloc[:100].SentenceProcessed)

In [None]:
Now, what does the `dtm` variable (an abbreviation for "document term-matrix") contain?

In [None]:
dtm.shape

In [None]:
The `.shape` attribute returns the dimensions of the matrix. It has 100 rows (because we selected the first 100 sentences) and 325 columns. 

Each column represents one feature. To inspect the features, use `.get_feature_names()` attached to the `CountVectorizer`. 

You see that the number of features corresponds to the number of columns in `dtm`.

In [None]:
len(vectorizer.get_feature_names())

In [None]:
The features are n-grams (of length 1 and 2) consisting of lemma_part-of-speech pairs.

In [None]:
print(vectorizer.get_feature_names()[100:110])

In [None]:
To inspect a document in vectorized form, we can convert it to a sparse `numpy.array`.

In [None]:
dtm[0]

In [None]:
The vector below is the numerical presentation of the first sentence in our DataFrame. This is the format in which we feed the text to the training algorithm.

In [None]:
dtm[0].toarray()