<a href="https://colab.research.google.com/github/manishiitg/ML_Experments/blob/master/nlp/101/tf_idf_experments.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**TFIDF**

Term Frequency - Inverse Document Frequency 

Implement and try out TFIDF using sklearn

In [1]:
import spacy 
nlp = spacy.load("en_core_web_sm")

from sklearn.feature_extraction.text import TfidfVectorizer

import matplotlib.pyplot as plt

import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [2]:
docs= [
  "NLP is awesome",
  "I love NLP"
]


count_vectorizer = TfidfVectorizer(
    analyzer="word", tokenizer=nltk.word_tokenize,
    preprocessor=None, stop_words='english', max_features=None)    

tfidf = count_vectorizer.fit_transform(docs)

print(tfidf.todense())

[[0.81480247 0.         0.57973867]
 [0.         0.81480247 0.57973867]]


Just to understand how the outputs would be. create a very simple array and see the output.


In [7]:
from sklearn.feature_extraction.text import CountVectorizer

docs= [
  "NLP is awesome",
  "I love NLP"
]


count_vectorizer = CountVectorizer(
    analyzer="word", tokenizer=nltk.word_tokenize,
    preprocessor=None, stop_words='english', max_features=None)    

bag_of_words = count_vectorizer.fit_transform(docs)

print(bag_of_words.todense())


[[1 0 1]
 [0 1 1]]


Comparing TF-IDF to Bag of words 

In [3]:
import nltk
nltk.download('punkt')

import pandas as pd
import numpy as np

from sklearn.datasets import fetch_20newsgroups

news = fetch_20newsgroups(subset="train")

print(news.keys())

df = pd.DataFrame(news['data'])
print(df.head())


tfidf_vectorizer = TfidfVectorizer(
    analyzer="word", tokenizer=nltk.word_tokenize,
    preprocessor=None, stop_words='english', max_features=100, max_df=.9)  

tfidf_vectorizer.fit(news["data"])

# matrix = count_vectorizer.transform(new_sentense.split())
# print(matrix.todense())
print(tfidf_vectorizer.get_feature_names())
print(tfidf_vectorizer.vocabulary_)

Xtr = tfidf_vectorizer.transform(news["data"])

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])
                                                   0
0  From: lerxst@wam.umd.edu (where's my thing)\nS...
1  From: guykuo@carson.u.washington.edu (Guy Kuo)...
2  From: twillis@ec.ecn.purdue.edu (Thomas E Will...
3  From: jgreen@amber (Joe Green)\nSubject: Re: W...
4  From: jcm@head-cfa.harvard.edu (Jonathan McDow...
['!', '#', '$', '%', '&', "'", "''", "'ax", "'d", "'ll", "'m", "'re", "'s", "'ve", '*', '+', '-', '--', '...', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', ';', '<', '=', '>', '?', '[', ']', '`', '``', 'article', 'believe', 'better', 'c', 'ca', 'computer', 'did', 'distribution', 'does', 'file', 'g', 'god', 'going', 'good', 'government', 'help', 'information', 'just', 'know', 'like', 'm', 'make', 'max', 'n', "n't", 'need', 'new', 'nntp-posting-host', 'number', 'o', 'p', 'people', 'point', 'prob

Applying TF-IDF on a much larger dataset of 20_news_groups dataset

*max_features* means that only the top 100 words will be taken in the final repreasentation 

*max_df* is the threshhold for max document frequency 

In [0]:

def top_tfidf_feats(row, features, top_n=25):
    ''' Get top n tfidf values in row and return them with their corresponding feature names.'''
    topn_ids = np.argsort(row)[::-1][:top_n]
    top_feats = [(features[i], row[i]) for i in topn_ids]
    df = pd.DataFrame(top_feats)
    df.columns = ['feature', 'tfidf']
    return df

def top_feats_in_doc(Xtr, features, row_id, top_n=25):
    ''' Top tfidf features in specific document (matrix row) '''
    row = np.squeeze(Xtr[row_id].toarray())
    return top_tfidf_feats(row, features, top_n)


def top_mean_feats(Xtr, features, grp_ids=None, min_tfidf=0.1, top_n=25):
    ''' Return the top n features that on average are most important amongst documents in rows
        indentified by indices in grp_ids. '''
    if grp_ids:
        D = Xtr[grp_ids].toarray()
    else:
        D = Xtr.toarray()

    D[D < min_tfidf] = 0
    tfidf_means = np.mean(D, axis=0)
    return top_tfidf_feats(tfidf_means, features, top_n)

These are some helper functions used to better analyze tf-idf representations 

In [0]:
# to clean the data 

def normalize(comment, lowercase=True, remove_stopwords=True):
    if lowercase:
        comment = comment.lower()
    lines = comment.splitlines()
    lines = [x.strip(' ') for x in lines]
    lines = [x.replace('"', '') for x in lines]
    lines = [x.replace('\\"', '') for x in lines]
    lines = [x.replace(u'\xa0', u'') for x in lines]
    comment = " ".join(lines)
    doc = nlp(comment)

    # for token in doc:
    #   print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
    #     token.shape_, token.is_alpha, token.is_stop)

    words = [token for token in doc if token.is_stop !=
             True and token.is_punct != True]
    # return " ".join(words)
    lemmatized = list()
    for word in words:
        lemma = word.lemma_.strip()
        if lemma:
            lemmatized.append(lemma)
    return " ".join(lemmatized)


cleaning data

In [6]:

data_to_process = news["data"][:100]

clean_data  = []


print("cleaning data")
for row in data_to_process:
  clean_data.append(normalize(row))

print("data cleaned")

tfidf_vectorizer = TfidfVectorizer()  

tfidf_vectorizer.fit(clean_data)

Xtr = tfidf_vectorizer.transform(clean_data)

features = tfidf_vectorizer.get_feature_names()

df2 = top_feats_in_doc(Xtr, features, 0, 10)

print("data without cleaning")
print(data_to_process[0])

print("cleaned data")
print(clean_data[0])

print("top features of the first document")
print(df2)

print("")

print("top means features across all documents")
df3 = top_mean_feats(Xtr, features)

print(df3)

cleaning data
data cleaned
data without cleaning
From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----





cleaned data
lerxst@wam.umd.edu thing subject car nntp post host rac3.wam.umd.edu organization university maryland college park line 15 wonder enlighten car see day 2-door sport car look late 60s/ early 70 call bricklin door smal