# TEXT PREPROCESSING - TF- IDF

**TF-IDF stands for Term Frequency - Inverse Document Frequency**

This approach is more intelligent than just COUNT or BINARY ( as in Bag Of Words)

**Term Frequency** is a measure of term wrt the sentence. i.e (#term in sentence/ # total terms in sentence).
Reward words having high occurrence in a document - Frequent

**Inverse Document Frequency** is a measure of each term wrt to document.i.e. log of(# of sentences/ #sentences with the term)
Penalize words appearing many times in a document collection. Too general words should not have have high weight eg. "or" "not" "is" "the" - Reward Rarity

Both terms are multiplied to give us the tf-idf value of each term wrt the sentence.

Hence word rare in a document collection but frequent in a particular document get high weight.

In other word, combining TF and IDF together => assign high weight to discriminative words in a document

In [2]:
#Natural Language Tool Kit (NLTK) is a NLP package for text data processing
import nltk

paragraph = """Paragraphs are the building blocks of papers. Many students define paragraphs \
in terms of length. A paragraph is a group of at least five sentences. Paragraph \
is half a page long, etc."""

In [5]:
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

sentences = nltk.sent_tokenize(paragraph)
sentences

['Paragraphs are the building blocks of papers.',
 'Many students define paragraphs in terms of length.',
 'A paragraph is a group of at least five sentences.',
 'Paragraph is half a page long, etc.']

In [8]:
lemmatizer = WordNetLemmatizer()
lemmatized_sentences = []

for sentence in sentences:
    sentence = re.sub('[^a-zA-Z]', ' ', sentence)
    words = nltk.word_tokenize(sentence)
    print("Words before lemmatization : ", words)
    
    lemmas = []
    for word in words:
        if word not in set(stopwords.words('english')):
            lemma = lemmatizer.lemmatize(word)
            lemma = lemma.lower()
            lemmas.append(lemma)
    
    lemmatized_sentence = ' '.join(lemmas)
    lemmatized_sentences.append(lemmatized_sentence)
        
    print("Words after lemmatizaton : ", lemmas)
    
print("Sentences after lemmatizaton : ", lemmatized_sentences)   

Words before lemmatization :  ['Paragraphs', 'are', 'the', 'building', 'blocks', 'of', 'papers']
Words after lemmatizaton :  ['paragraphs', 'building', 'block', 'paper']
Words before lemmatization :  ['Many', 'students', 'define', 'paragraphs', 'in', 'terms', 'of', 'length']
Words after lemmatizaton :  ['many', 'student', 'define', 'paragraph', 'term', 'length']
Words before lemmatization :  ['A', 'paragraph', 'is', 'a', 'group', 'of', 'at', 'least', 'five', 'sentences']
Words after lemmatizaton :  ['a', 'paragraph', 'group', 'least', 'five', 'sentence']
Words before lemmatization :  ['Paragraph', 'is', 'half', 'a', 'page', 'long', 'etc']
Words after lemmatizaton :  ['paragraph', 'half', 'page', 'long', 'etc']
Sentences after lemmatizaton :  ['paragraphs building block paper', 'many student define paragraph term length', 'a paragraph group least five sentence', 'paragraph half page long etc']


In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf_idf_vectorizer = TfidfVectorizer()
tf_idf_vector = tf_idf_vectorizer.fit_transform(lemmatized_sentences)

for count, word in zip(tf_idf_vector.toarray().tolist()[0], tf_idf_vectorizer.get_feature_names()) :
    print("word : {} count : {} ".format(word, count))

word : block count : 0.5 
word : building count : 0.5 
word : define count : 0.0 
word : etc count : 0.0 
word : five count : 0.0 
word : group count : 0.0 
word : half count : 0.0 
word : least count : 0.0 
word : length count : 0.0 
word : long count : 0.0 
word : many count : 0.0 
word : page count : 0.0 
word : paper count : 0.5 
word : paragraph count : 0.0 
word : paragraphs count : 0.5 
word : sentence count : 0.0 
word : student count : 0.0 
word : term count : 0.0 
