This document will clearly explain how to use the following algorithms:

1. N-Grams
2. Bag of Words
3. Term Frequency-Inverse Document Frequency (TF-IDF)

# N-Grams
***
N-Grams are sets of consecutive works. N-Grams allow us to predict the next words of a piece of text.

Consider the sentence "FinTechExplained is a publication". 

1-Grams: FinTechExplained,is,a,publication

2-Grams: FinTechExplained is, is a, a publication

3-Grams: FinTechExplained is a, is a publication

We can compute the N-grams using `NLTK`:

In [10]:
import nltk
from nltk.util import ngrams
from collections import Counter

text = 'FinTechExplained is a publication'
tokenized_text = nltk.word_tokenize( text )


one_grams = ngrams( tokenized_text, 1 )
two_grams = ngrams( tokenized_text, 2 )
three_grams = ngrams( tokenized_text, 3 )

print("One grams:")
for one_gram in one_grams:
    print(one_gram)
print("\nTwo grams:")
for two_gram in two_grams:
    print(two_gram)
print("\nThree grams:")
for three_gram in three_grams:
    print(three_gram)

One grams:
('FinTechExplained',)
('is',)
('a',)
('publication',)

Two grams:
('FinTechExplained', 'is')
('is', 'a')
('a', 'publication')

Three grams:
('FinTechExplained', 'is', 'a')
('is', 'a', 'publication')


# Bag of Words
***
We need a numerical representation for our words before it can be applied into a machine learning algorithm.

One approach is to count the occurrence of words in a document. The bag of words representation is about creating a matrix of words, where the words are represented as rows and the columns represent the document names. Then we can populate this matrix with the frequency of each term within the document.

This matrix is called the **Term Document Matrix**.

**Each row is a word vector. Each column is a document vector**

One example is if we consider a set of tweets from twitter and statuses from facebook, containing the word "NLP". 

One may tokenize the sentences into words and then populate the TDM where there will be a column for Facebook and then another column for Twitter.

The rows will correspond to the words occurring in any of the tweets or posts. The values of this matrix will correspond to the count of that word occurring in the document corresponding to that column.

We will show how to populate this dictionary using pandas:

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

twitter_data = []
facebook_data = []

data = {'twitter':twitter_data,\
       'facebook':facebook_data}

vectorizer = CountVectorizer()
fitted_vectorizer = vectorizer.fit_transform( data['twitter']\
                    .append(data['facebook']) )

TDM = pd.DataFrame( fitted_vectorizer.toarray().transpose(),\
                  index=fitted_vectorizer.get_feature_names(),\
                 columns=['twitter','facebook'] )

# Term Frequency-Inverse Document Frequency (TF-IDF)
***
TF-IDF is a great statistical measure for understanding the relevance of the word within a corpus of documents.

For each document, we compute a matrix by performing the steps:
* TF - (term frequency) calculate the number of times a term appears in a document divided by total number of terms in a document
* IDF - (inverse document frequency) compute the logarithm of the total number of documents, divided by the number of documents containing the term.
* Multiply the TF by the IDF. 

Rows of this matrix represent terms and the columns represent document names.

## A Simple Example
***
Assume there are 100 documents, and 4 of them contain the term "FinTechExplained" (mentioned once in the first and second document, twice in the third document and three times in the fourth document). Also assume there are 100 words in each document.

There are 100 words in each of the 100 documents. Only four of the documents contain the word of interest.

In this case, the TF in each of the four documents will be:

Document 1 - 1/100

Document 2 - 1/100

Document 3 - 2/100

Document 4 - 3/100

For the inverse document frequency, because four of the 100 documents contain the word of interest, we get $\log(100/4)$ for the IDF.

Therefore, the TF-IDF of the term of interest in Document 1 is:
$(1/100)(\log(100/4))$

We can compute the same quantity using scikit learn:

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
fitted_vectorizer = vectorizer.fit_transform( data['twitter']\
                    .append(data['facebook']) )

dataframe = pd.DataFrame( fitted_vectorizer.toarray().transpose(), \
                    index=fitted_vectorizer.get_feature_names(),\
                    columns=['twitter','facebook'] )