# TF-IDF (Term Frequancy - Inverse Document Frequancy)

A problem with scoring word frequency is that highly frequent words start to dominate in the document, but may not contain as much ***informational content*** to the model as rarer but perhaps domain specific words.

One approach is to rescale the frequency of words by how often they appear in all documents, so that the scores for frequent words like “the” that are also frequent across all documents are penalized.

This approach to scoring is called ***Term Frequency – Inverse Document Frequency***, or TF-IDF for short, where:

 * Term Frequency: is a scoring of the frequency of the word in the current document.
 * Inverse Document Frequency: is a scoring of how rare the word is across documents.
 * The scores are a weighting where not all words are equally as important or interesting.


Without going into the math, TF-IDF are word frequency scores that try to highlight words that are more interesting, e.g. frequent in a document but not across documents.

## Implementing TF-IDF Model

### Load Libraries

In [15]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re

from sklearn.feature_extraction.text import TfidfVectorizer

In [16]:
english_text = """Perhaps one of the most significant advances in  made by Arabic mathematics began at this time with the work of al-Khwarizmi, namely the beginnings of algebra. It is important to understand just how significant this new idea was. It was a revolutionary move away from the Greek concept of mathematics which was essentially geometry. Algebra was a unifying theory which allowedrational numbers,irrational numbers, geometrical magnitudes, etc., to all be treated as \"algebraic objects\". It gave mathematics a whole new development path so much broader in concept to that which had existed before, and provided a vehicle for future development of the subject. Another important aspect of the introduction of algebraic ideas was that it allowed mathematics to be applied to itselfin a way which had not happened before."""

In [17]:
english_text

'Perhaps one of the most significant advances in  made by Arabic mathematics began at this time with the work of al-Khwarizmi, namely the beginnings of algebra. It is important to understand just how significant this new idea was. It was a revolutionary move away from the Greek concept of mathematics which was essentially geometry. Algebra was a unifying theory which allowedrational numbers,irrational numbers, geometrical magnitudes, etc., to all be treated as "algebraic objects". It gave mathematics a whole new development path so much broader in concept to that which had existed before, and provided a vehicle for future development of the subject. Another important aspect of the introduction of algebraic ideas was that it allowed mathematics to be applied to itselfin a way which had not happened before.'

In [18]:
arabic_text =u"""ربما كانت أحد أهم التطورات التي قامت بها الرياضيات العربية التي بدأت في هذا الوقت بعمل الخوارزمي وهي بدايات الجبر, ومن المهم فهم كيف كانت هذه الفكرة الجديدة مهمة, فقد كانت خطوة نورية بعيدا عن المفهوم اليوناني للرياضيات التي هي في جوهرها هندسة, الجبر کان نظرية موحدة تتيح الأعداد الكسرية والأعداد اللا كسرية, والمقادير الهندسية وغيرها, أن تتعامل على أنها أجسام جبرية, وأعطت الرياضيات ككل مسارا جديدا للتطور بمفهوم أوسع بكثير من الذي كان موجودا من قبل, وقم وسيلة للتنمية في هذا الموضوع مستقبلا. وجانب آخر مهم لإدخال أفكار الجبر وهو أنه سمح بتطبيق الرياضيات على نفسها بطريقة لم تحدث من قبل"""

In [19]:
arabic_text

'ربما كانت أحد أهم التطورات التي قامت بها الرياضيات العربية التي بدأت في هذا الوقت بعمل الخوارزمي وهي بدايات الجبر, ومن المهم فهم كيف كانت هذه الفكرة الجديدة مهمة, فقد كانت خطوة نورية بعيدا عن المفهوم اليوناني للرياضيات التي هي في جوهرها هندسة, الجبر کان نظرية موحدة تتيح الأعداد الكسرية والأعداد اللا كسرية, والمقادير الهندسية وغيرها, أن تتعامل على أنها أجسام جبرية, وأعطت الرياضيات ككل مسارا جديدا للتطور بمفهوم أوسع بكثير من الذي كان موجودا من قبل, وقم وسيلة للتنمية في هذا الموضوع مستقبلا. وجانب آخر مهم لإدخال أفكار الجبر وهو أنه سمح بتطبيق الرياضيات على نفسها بطريقة لم تحدث من قبل'

### Data Cleaning

#### Text to sentences

In [27]:
english_sentences = nltk.sent_tokenize(english_text)
arabic_sentences = nltk.sent_tokenize(arabic_text)

In [21]:
print(len(english_sentences), 'English paragraphs')
print(len(arabic_sentences), 'Arabic paragraphs')

6 English paragraphs
2 Arabic paragraphs


In [22]:
english_sentences

['Perhaps one of the most significant advances in  made by Arabic mathematics began at this time with the work of al-Khwarizmi, namely the beginnings of algebra.',
 'It is important to understand just how significant this new idea was.',
 'It was a revolutionary move away from the Greek concept of mathematics which was essentially geometry.',
 'Algebra was a unifying theory which allowedrational numbers,irrational numbers, geometrical magnitudes, etc., to all be treated as "algebraic objects".',
 'It gave mathematics a whole new development path so much broader in concept to that which had existed before, and provided a vehicle for future development of the subject.',
 'Another important aspect of the introduction of algebraic ideas was that it allowed mathematics to be applied to itselfin a way which had not happened before.']

#### Clean English Text

In [32]:
WordNet = WordNetLemmatizer()
english_corpus = []

for i in range(len(english_sentences)):
    # work with only text
    cleaning_text = re.sub('[^a-zA-Z]', ' ', english_sentences[i])
    # text to lower case
    cleaning_text = cleaning_text.lower()
    # tokenize each sentence
    cleaning_text = cleaning_text.split()
    # lematize each word
    sentence_lem = [WordNet.lemmatize(word) for word in cleaning_text if not word in set(stopwords.words("english"))]
    sentence = ' '.join(sentence_lem)
    english_corpus.append(sentence)

In [33]:
english_corpus

['perhaps one significant advance made arabic mathematics began time work al khwarizmi namely beginning algebra',
 'important understand significant new idea',
 'revolutionary move away greek concept mathematics essentially geometry',
 'algebra unifying theory allowedrational number irrational number geometrical magnitude etc treated algebraic object',
 'gave mathematics whole new development path much broader concept existed provided vehicle future development subject',
 'another important aspect introduction algebraic idea allowed mathematics applied itselfin way happened']

#### Clean Arabic Text

In [25]:
WordNet = WordNetLemmatizer()
arabic_corpus = []

for i in range(len(arabic_sentences)):
    # tokenize each sentence
    cleaning_text = arabic_sentences[i].split()

    # lematize each word
    sentence_lem = [WordNet.lemmatize(word) for word in cleaning_text if not word in set(stopwords.words("arabic"))]
    sentence = ' '.join(sentence_lem)
    arabic_corpus.append(sentence)


# TF - IDF

### Create the transform

In [28]:
vectorizer = TfidfVectorizer()
vectorizer.fit(english_corpus)

### Summarize

In [37]:
print(vectorizer.vocabulary_)
print(vectorizer.idf_)

{'perhaps': 42, 'one': 40, 'significant': 45, 'advance': 0, 'made': 31, 'arabic': 8, 'mathematics': 33, 'began': 11, 'time': 48, 'work': 55, 'al': 1, 'khwarizmi': 30, 'namely': 36, 'beginning': 12, 'algebra': 2, 'important': 26, 'understand': 50, 'new': 37, 'idea': 25, 'revolutionary': 44, 'move': 34, 'away': 10, 'greek': 23, 'concept': 14, 'essentially': 16, 'geometry': 22, 'unifying': 51, 'theory': 47, 'allowedrational': 5, 'number': 38, 'irrational': 28, 'geometrical': 21, 'magnitude': 32, 'etc': 17, 'treated': 49, 'algebraic': 3, 'object': 39, 'gave': 20, 'whole': 54, 'development': 15, 'path': 41, 'much': 35, 'broader': 13, 'existed': 18, 'provided': 43, 'vehicle': 52, 'future': 19, 'subject': 46, 'another': 6, 'aspect': 9, 'introduction': 27, 'allowed': 4, 'applied': 7, 'itselfin': 29, 'way': 53, 'happened': 24}
[2.25276297 2.25276297 1.84729786 1.84729786 2.25276297 2.25276297
 2.25276297 2.25276297 2.25276297 2.25276297 2.25276297 2.25276297
 2.25276297 2.25276297 1.84729786 2.

### Encode document

In [45]:
vectors= vectorizer.transform(english_corpus)
# summarize encoded vector
print(vectors.shape)
print(vectors.toarray())

(6, 56)
[[0.27020314 0.27020314 0.22157044 0.         0.         0.
  0.         0.         0.27020314 0.         0.         0.27020314
  0.27020314 0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.27020314 0.27020314 0.         0.16030048 0.         0.
  0.27020314 0.         0.         0.         0.27020314 0.
  0.27020314 0.         0.         0.22157044 0.         0.
  0.27020314 0.         0.         0.         0.         0.
  0.         0.27020314]
 [0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.42690011 0.42690011 0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.42690011 0.         0.         0.         0.

In [51]:
    vectorizer.get_feature_names()

['advance',
 'al',
 'algebra',
 'algebraic',
 'allowed',
 'allowedrational',
 'another',
 'applied',
 'arabic',
 'aspect',
 'away',
 'began',
 'beginning',
 'broader',
 'concept',
 'development',
 'essentially',
 'etc',
 'existed',
 'future',
 'gave',
 'geometrical',
 'geometry',
 'greek',
 'happened',
 'idea',
 'important',
 'introduction',
 'irrational',
 'itselfin',
 'khwarizmi',
 'made',
 'magnitude',
 'mathematics',
 'move',
 'much',
 'namely',
 'new',
 'number',
 'object',
 'one',
 'path',
 'perhaps',
 'provided',
 'revolutionary',
 'significant',
 'subject',
 'theory',
 'time',
 'treated',
 'understand',
 'unifying',
 'vehicle',
 'way',
 'whole',
 'work']

In [40]:
import pandas as pd

In [53]:
pd.DataFrame(vectors.toarray(), columns=vectorizer.get_feature_names())

Unnamed: 0,advance,al,algebra,algebraic,allowed,allowedrational,another,applied,arabic,aspect,...,subject,theory,time,treated,understand,unifying,vehicle,way,whole,work
0,0.270203,0.270203,0.22157,0.0,0.0,0.0,0.0,0.0,0.270203,0.0,...,0.0,0.0,0.270203,0.0,0.0,0.0,0.0,0.0,0.0,0.270203
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.520601,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.216508,0.216508,0.0,0.264029,0.0,0.0,0.0,0.0,...,0.0,0.264029,0.0,0.264029,0.0,0.264029,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.252403,0.0,0.0,0.0,0.0,0.0,0.252403,0.0,0.252403,0.0
5,0.0,0.0,0.0,0.254653,0.310547,0.0,0.310547,0.310547,0.0,0.310547,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.310547,0.0,0.0


### Advantages:

- Easy to compute
- You have some basic metric to extract the most descriptive terms in a document
- You can easily compute the similarity between 2 documents using it

### Disadvantages:
- TF-IDF is based on the bag-of-words (BoW) model, therefore it does not capture position in text, semantics, co-occurrences in different documents, etc.
- For this reason, TF-IDF is only useful as a lexical level feature
- Cannot capture semantics (e.g. as compared to topic models, word embeddings)

### Printing Dependencies

### Printing Dependencies

In [56]:
%load_ext watermark

In [57]:
%watermark --iversion

nltk  : 3.5
re    : 2.2.1
pandas: 1.1.3

