# Feature Extraction from Text

This notebook is divided into two sections:
* First, we'll find out what what is necessary to build an NLP system that can turn a body of text into a numerical array of *features* by **manually calcuating frequencies and building out TF-IDF**.
* Next we'll show how to perform these steps **using scikit-learn tools**.
---------

+ [Part One: Core Concepts on Feature Extraction (MANUALLY)](#partone)
    + [1) Start with some documents:](#1)
    + [2) Building a vocabulary (Creating a "Bag of Words")](#2)
        + [2.1) Getting the unique words only](#2.1)
        + [2.2) Get all unique words across all documents (both One and Two)](#2.2)
        + [2.3) Create vocab dictionary with related index](#2.3)
    + [3)Bag of Words to Frequency Counts](#3)
        + [3.1) Make A list of All Vocab (which will be used to map later)](#3.1)
        + [3.2) Add in counts per word per doc:](#3.2)
        + [3.3) Create the DataFrame:](#3.3)
---------
+ [Concepts to Consider:](#concept)
    + [Bag of Words and Tf-idf](#bow)
    + [Stop Words and Word Stems](#stopwordswordstems)
    + [Tokenization and Tagging](#tokenizationtagging)
--------

+ [Part Two:  Feature Extraction with Scikit-Learn](#part2)
    + [1) CountVectorizer](#countvect)
    + [`stop_words` parameter](#stopwords)
    + [2) TfidfTransformer](#tfidf)
    + [3) Using Pipeline (combining two steps of CountVectorizer + Tfidf Transformer)](#pipeline)
    + [4) TfIdfVectorizer (same as step 1 + step 2)](#tfidfvector)

-----

# <a name=partone>Part One: Core Concepts on Feature Extraction (MANUALLY)</a>

In this section we'll use basic Python to build a rudimentary NLP system. We'll build a *corpus of documents* (two small text files), create a *vocabulary* from all the words in both documents, and then demonstrate a *Bag of Words* technique to extract features from each document.<br>
<div class="alert alert-info" style="margin: 20px">This first section is for illustration only!
<br>Don't worry about memorizing this code - later on we will let Scikit-Learn Preprocessing tools do this for us.</div>

# <a name=1>1) Start with some documents:</a>
For simplicity we won't use any punctuation in the text files One.txt and Two.txt. Let's quickly open them and read them. Keep in mind, you should avoid opening and reading entire files if they are very large, as Python could just display everything depending on how you open the file.


In [1]:
with open('../Data/One.txt') as mytext:
    a = mytext.read()
    print(a)

This is a story about dogs
our canine pets
Dogs are furry animals



In [3]:
# readlines returns as list
with open('../Data/One.txt') as mytext:
    a = mytext.readlines()
    print(a)

['This is a story about dogs\n', 'our canine pets\n', 'Dogs are furry animals\n']


### Reading entire text as a string

In [8]:
with open('../Data/Two.txt') as mytext:
    entire_text = mytext.read()
    entire_text

In [10]:
print(entire_text)

This story is about surfing
Catching waves is fun
Surfing is a popular water sport



### Reading Each Line as a List

In [13]:
with open('../Data/One.txt') as mytext:
    lines = mytext.readlines()

In [14]:
lines

['This is a story about dogs\n',
 'our canine pets\n',
 'Dogs are furry animals\n']

### Reading in Words Separately

In [16]:
with open('../Data/One.txt') as mytext:
    words = mytext.read().lower().split()

In [18]:
words

['this',
 'is',
 'a',
 'story',
 'about',
 'dogs',
 'our',
 'canine',
 'pets',
 'dogs',
 'are',
 'furry',
 'animals']

-----

# <a name=2>2) Building a vocabulary (Creating a "Bag of Words")</a>

Let's create dictionaries that correspond to **unique mappings of the words in the documents**. We can begin to think of this as mapping out all the possible words available for all (both) documents.

#### Read in One.txt

In [26]:
with open('../Data/One.txt') as mytext:
    words_one = mytext.read().lower().split()

In [27]:
words_one

['this',
 'is',
 'a',
 'story',
 'about',
 'dogs',
 'our',
 'canine',
 'pets',
 'dogs',
 'are',
 'furry',
 'animals']

In [28]:
len(words_one)

13

## <a name=2.1>2.1) Getting the unique words only</a>

In [29]:
unique_words_one = set(words_one)

In [30]:
unique_words_one

{'a',
 'about',
 'animals',
 'are',
 'canine',
 'dogs',
 'furry',
 'is',
 'our',
 'pets',
 'story',
 'this'}

In [32]:
len(unique_words_one)

12

Now we only have 12 unique words instead of original 13 words in Document one.

### Repeat for Two.txt

In [34]:
with open('../Data/Two.txt') as mytext:
    words_two = mytext.read().lower().split()
    unique_words_two = set(words_two)

In [35]:
len(words_two), len(unique_words_two)

(15, 12)

## <a name=2.2>2.2) Get all unique words across all documents (both One and Two)</a>

In [44]:
all_unique_words = set()

all_unique_words.update(unique_words_one)

In [45]:
print(all_unique_words)

{'our', 'canine', 'about', 'pets', 'a', 'this', 'are', 'furry', 'animals', 'dogs', 'is', 'story'}


In [46]:
all_unique_words.update(unique_words_two)

In [48]:
print(all_unique_words)

{'water', 'catching', 'waves', 'this', 'are', 'furry', 'animals', 'sport', 'is', 'story', 'our', 'canine', 'about', 'pets', 'a', 'popular', 'fun', 'dogs', 'surfing'}


## <a name=2.3>2.3) Create vocab dictionary with related index</a>

In [49]:
full_vocab = {}
i = 0

for word in all_unique_words:
    full_vocab[word] = i
    i +=1

Take note that Set is not ordered. So words will not in ordered.
The for loop goes through the set() in the most efficient way possible, not in alphabetical order!

In [50]:
full_vocab

{'water': 0,
 'catching': 1,
 'waves': 2,
 'this': 3,
 'are': 4,
 'furry': 5,
 'animals': 6,
 'sport': 7,
 'is': 8,
 'story': 9,
 'our': 10,
 'canine': 11,
 'about': 12,
 'pets': 13,
 'a': 14,
 'popular': 15,
 'fun': 16,
 'dogs': 17,
 'surfing': 18}

-----

# <a name=3>3)Bag of Words to Frequency Counts</a>

Now that we've encapsulated our "entire language" in a dictionary, let's perform *feature extraction* on each of our original documents:

#### Empty counts per doc

In [59]:
one_freq = [0] * len(full_vocab)
two_freq = [0] * len(full_vocab)
all_words = [''] * len(full_vocab)

In [60]:
one_freq

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

## <a name=3.1>3.1) Make A list of All Vocab (which will be used to map later)</a>

In [62]:
for word in full_vocab:
    word_index = full_vocab[word]
    all_words[word_index] = word

In [64]:
print(all_words)

['water', 'catching', 'waves', 'this', 'are', 'furry', 'animals', 'sport', 'is', 'story', 'our', 'canine', 'about', 'pets', 'a', 'popular', 'fun', 'dogs', 'surfing']


## <a name=3.2>3.2) Add in counts per word per doc:</a>

In [65]:
# map the frequencies of each word in 1.txt to our vector:
with open('../Data/One.txt') as file:
    one_text  = file.read().lower().split()
    
for word in one_text:
    word_index = full_vocab[word] #get the index of that specific word
    one_freq[word_index] += 1 # increase by one

In [66]:
one_freq

[0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 2, 0]

In [67]:
# Do the same for the second document:
with open('../Data/Two.txt') as file:
    two_text = file.read().lower().split()
    
for word in two_text:
    word_index = full_vocab[word]
    two_freq[word_index] += 1

In [68]:
two_freq

[1, 1, 1, 1, 0, 0, 0, 1, 3, 1, 0, 0, 1, 0, 1, 1, 1, 0, 2]

## <a name=3.3>3.3) Create the DataFrame:</a>

In [69]:
import pandas as pd

In [71]:
bow = pd.DataFrame(data=[one_freq, two_freq], columns=all_words)

In [72]:
bow

Unnamed: 0,water,catching,waves,this,are,furry,animals,sport,is,story,our,canine,about,pets,a,popular,fun,dogs,surfing
0,0,0,0,1,1,1,1,0,1,1,1,1,1,1,1,0,0,2,0
1,1,1,1,1,0,0,0,1,3,1,0,0,1,0,1,1,1,0,2


Now we can how frequently each word appears in the documents.

By comparing the vectors we see that some words are common to both, some appear only in `One.txt`, others only in `Two.txt`. Extending this logic to tens of thousands of documents, we would see the vocabulary dictionary grow to hundreds of thousands of words. Vectors would contain mostly zero values, making them **sparse matrices**.


# <a name=concept>Concepts to Consider:</a>

## <a name=bow>Bag of Words and Tf-idf</a>
In the above examples, each vector can be considered a *bag of words*. By itself these may not be helpful until we consider *term frequencies*, or how often individual words appear in documents. A simple way to calculate term frequencies is to divide the number of occurrences of a word by the total number of words in the document. In this way, the number of times a word appears in large documents can be compared to that of smaller documents.

However, it may be hard to differentiate documents based on term frequency if a word shows up in a majority of documents. To handle this we also consider *inverse document frequency*, which is the total number of documents divided by the number of documents that contain the word. In practice we convert this value to a logarithmic scale, as described [here](https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Inverse_document_frequency).

Together these terms become [**tf-idf**](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

## <a name=stopwordswordstems>Stop Words and Word Stems</a>

Some words like "the" and "and" appear so frequently, and in so many documents, that we needn't bother counting them. Also, it may make sense to only record the root of a word, say `cat` in place of both `cat` and `cats`. This will shrink our vocab array and improve performance.

## <a name=tokenizationtagging>Tokenization and Tagging</a>
When we created our vectors the first thing we did was split the incoming text on whitespace with `.split()`. This was a crude form of *tokenization* - that is, dividing a document into individual words. In this simple example we didn't worry about punctuation or different parts of speech. In the real world we rely on some fairly sophisticated *morphology* to parse text appropriately.

Once the text is divided, we can go back and *tag* our tokens with information about parts of speech, grammatical dependencies, etc. This adds more dimensions to our data and enables a deeper understanding of the context of specific documents. For this reason, vectors become ***high dimensional sparse matrices***.

-------
-------

# <a name=part2>Part Two:  Feature Extraction with Scikit-Learn</a>

Let's explore the more realistic process of using sklearn to complete the tasks mentioned above!

# Scikit-Learn's Text Feature Extraction Options

In [73]:
text = [
    'This is a line',
    'This is another line',
    'Completely different line',
]

## <a name=countvect>1) CountVectorizer</a>

In [81]:
from sklearn.feature_extraction.text import CountVectorizer

In [103]:
cv = CountVectorizer()

+ cv will treat each value as one document
+ `fit_transform` is basically doing get the unique vocabulary on fit and then transform it by acutally performing frequency count on each documents insides that list.
+ and it returns the `sparse matrix`. The reasonis when performing vectorizing and building out bag of words model, what's going to happen is most of the items in the matrix are going to be zeros. So when you are dealing with hundards and thousands of documents with many many words, you want to make sure you do't eat up too much of PC's memory unnecessarily by just storing a bunch of zeros. Which is why we have sparse matrix.
+ sparse matrix with 3x6 matrix. Why 3? because there are 3 documents in the list we passed. When we used `todense()` method, we can see the originally stored frequency count which is not in sparse matrix form (which stores information in memory efficient way). **NOTE: we don't want to call this method if we have a large values of words, which gonna take a lot of memory space**

In [104]:
sparse_matrix = cv.fit_transform(text)

In [105]:
sparse_matrix

<3x6 sparse matrix of type '<class 'numpy.int64'>'
	with 10 stored elements in Compressed Sparse Row format>

#### using todense() to see in original form

In [106]:
sparse_matrix.todense()

matrix([[0, 0, 0, 1, 1, 1],
        [1, 0, 0, 1, 1, 1],
        [0, 1, 1, 0, 1, 0]], dtype=int64)

#### vocabulary_

In [89]:
cv.vocabulary_

{'this': 5, 'is': 3, 'line': 4, 'another': 0, 'completely': 1, 'different': 2}

If we closely look at the value, `another` is at index `0`. So if you look at results of todense(), see index 0 has value 1 on second document.

which make senese because  **This is another line** is the second document which has the word another.

## <a name=stopwords>`stop_words` parameter</a>
+ with the use of this parameter, common stop words in English are not longer part of the vocab.


In [92]:
cv = CountVectorizer(stop_words='english')

In [94]:
sparse_matrix = cv.fit_transform(text)

In [95]:
cv.vocabulary_

{'line': 2, 'completely': 0, 'different': 1}

-------

## <a name=tfidf>2) TfidfTransformer</a>

+ TfidfVectorizer is used on sentences, while TfidfTransformer is used on an existing count matrix, such as one returned by CountVectorizer
+ using this, **we can transform Bag of Words into TF-IDF**.

In [96]:
from sklearn.feature_extraction.text import TfidfTransformer

In [97]:
tfidf = TfidfTransformer()

In [107]:
sparse_matrix

<3x6 sparse matrix of type '<class 'numpy.int64'>'
	with 10 stored elements in Compressed Sparse Row format>

In [108]:
results = tfidf.fit_transform(sparse_matrix) # BOW ===> TF-IDF

In [109]:
results

<3x6 sparse matrix of type '<class 'numpy.float64'>'
	with 10 stored elements in Compressed Sparse Row format>

In [110]:
results.todense()

matrix([[0.        , 0.        , 0.        , 0.61980538, 0.48133417,
         0.61980538],
        [0.63174505, 0.        , 0.        , 0.4804584 , 0.37311881,
         0.4804584 ],
        [0.        , 0.65249088, 0.65249088, 0.        , 0.38537163,
         0.        ]])

## <a name=pipeline>3) Using Pipeline (combining two steps of CountVectorizer + Tfidf Transformer)</a>

In [111]:
from sklearn.pipeline import Pipeline

In [112]:
pipe = Pipeline([
    ('cv', CountVectorizer()),
    ('tfidf', TfidfTransformer())
])

In [113]:
results = pipe.fit_transform(text)
results

<3x6 sparse matrix of type '<class 'numpy.float64'>'
	with 10 stored elements in Compressed Sparse Row format>

In [114]:
results.todense()

matrix([[0.        , 0.        , 0.        , 0.61980538, 0.48133417,
         0.61980538],
        [0.63174505, 0.        , 0.        , 0.4804584 , 0.37311881,
         0.4804584 ],
        [0.        , 0.65249088, 0.65249088, 0.        , 0.38537163,
         0.        ]])

----

## <a name=tfidfvector>4) TfIdfVectorizer</a>

Does both above (step 1 and 2) in a single step!

In [116]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [117]:
tv = TfidfVectorizer()

In [119]:
tv_results = tv.fit_transform(text)

In [120]:
tv_results.todense()

matrix([[0.        , 0.        , 0.        , 0.61980538, 0.48133417,
         0.61980538],
        [0.63174505, 0.        , 0.        , 0.4804584 , 0.37311881,
         0.4804584 ],
        [0.        , 0.65249088, 0.65249088, 0.        , 0.38537163,
         0.        ]])

We can see both method yield same results.