<div class="container">
  <div class="jumbotron">
    <h1>Feature Extraction from text</h1>
    <p>Part One</p>
  </div>
</div>

This is part one. Here we're going to be going over that basic theory of feature extraction.

- Most classic machine learning algorithms can't actually accept or take in raw text.

- Instead, we need to perform some sort of feature "extraction" from the raw text in order to pass numerical features to the machine learning algorithm.

- For example, we could count the occurrence of each word to map text to an actual number.

- Let's discuss **Counter Vectorization** along with **Term-Frequency** and **Inverse Document Frequency**.

### Count Vectorization

- So **Count Vectorization** looks like this:

```python
messages = ["Hey, lets go to the game today!",
            "call your sister.",
            "Want to go walk your dogs?"]
```

Let's imagine we had a list of messages. So here we have three messages. So right now everything is in raw text or a string in Python, what we can do is using Scikit-Learn we can perform **Count Vectorization**.

```python
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
```

Notice the syntax here looks really similar to the same way we created machine learning models of Scikit-Learn. We say from `sklearn` and then some family of feature extraction models specifically from text we import this count vectorizer.  What this is going to do it's going to count the occurrences of all the unique words.

So after you create your count vectorizer you're going to vectorize those messages. And basically what happens is it treats each individual unique word as a feature.

```python
vect.get_feature_names()
```
```
['call', 'dogs', 'game', 'go', 'hey', 'lets', 'sister', 'the',
 'to', 'today', 'walk', 'want', 'your']
```


### Document Term Matrix (DTM)

Then it's going to count the occurrence of each unique feature or word throughout every single document. And each document is essentially just each text message.

![](../imgs/txt01.png)

So it can think of that term document as just another word for every documented text.

So if we take a look here. Notice that the word "call" didn't show up in the first message or the third message. It only showed up in the middle message. Or the word "dogs" only shows up in the last message.

Now some words do show it more than once like the word "go" and the word "too".

But, overall, we should see a lot of zeros because most messages are not going to contain all the words.

So, you can imagine for a very large set of documents, otherwise known as a **corpus**, we're going to have what's known as a very **sparse matrix** a matrix of a lot of zeros. This sort of matrix is known as the **Document Term Matrix** or **DTM**.

Again we're just counting the number of times each unique word, throughout the entire vocabulary of all the documents, shows up in each particular document.

- Now an alternative to **Count Vectorization** is something called `TfidfVectorizer` or **Term Frequency - Inverse Document Frequency vectorizer** **TF-IDF**. It's also going to create a document term matrix from our messages.

- However, instead of filling in the **Document Term Matrix** or **DTM** with token counts, it calculates term frequency-inverse document frequency value for each word (**TF-IDF**).

So let's talk about what that actually means. What does TF-IDF mean.

- Well let's first look at that first term **tf(t,d)** or **term frequency**. We can think of term frequency as a function of the term and the particular document and all it is it's the raw count of a term in a document. Basically just answering the question, the number of times that that particular term **$t$** occurs in document **$d$**.

So, for example we take a look at the term frequency for the word **dogs**.

![](../imgs/txt01.png)

The term frequency for the first two messages would be zero. And then the term frequency for the word dogs in the last message would be one.

- Again the term frequency **tf(d,d)**: is just the raw count of a term in a document, something that we just saw.

- However,  Term Frequency alone isn't enough for a thorough feature analysis of the text.

- Let's imagine very common terms like **stop words** such as "a" or "the". Because the terms such as "a" or "the" are so common,  term frequency will tend to incorrectly emphasize documents which happen to use words like "the" more frequently, without giving enough weight to the more meaningful terms like unique words such as "red" and "dogs". 

- An inversed document frequency factor is incorporated which diminishes the weight of terms that occur very frequently in the documents set and increases the weight the terms that occur rarely. 

- It is the logarithmically scaled inverse fraction of the documents that contain the word (obtained by dividing the total number of documents by the number of documents containing the term and then taking the logarithmic of that quotient.)


- **TF-IDF= term frequency * (1/document frequency)**
<br>
- **TF-IDF= term frequency * inverse document freq**

$$tfidf(t,d,D)=tf(t,d) \cdot idf(t,D)$$
<br>
$$idf(t,D)=log\frac{N}{|{d \in D : t \in d}|}$$

- Fortunately for us Scikit-Learn can calculate all of these terms for us through the use of its API.

- Keep in mind you should notice how similar the syntaxes to our previous use of machine learning models and Scikit-Learn.

Here you can see that from `sklearn.feature_extraction.text` we're able to import the `TfidfVectorizer`.

And in this case we're going to want to call `fit_transform()` on the actual raw text messages which is going to perform turn frequency inversed document frequency vectorization on the actual text messages.

```python
from sklearn.feature_extraction.text import TfidfVectorizer

vect = TfidfVectorizer()
dtm = vect.fit_transform(messages)
```

![](../imgs/txt02.png)

So all this is doing is not only is it taking into account the term frequency, how many times the terms show up in a single document, but as well as the inversed document frequency; how often do these terms show up in all across all the documents.

That way the very common words aren't weighed as much as the unique words. 


- TF-IDF allows us to understand the context of words across an entire corpus of documents instead of just its relative importance in a single document.

- Coming up next we're going to explore how to perform these operations with Python and Scikit-Learn.