<a href="https://colab.research.google.com/github/nandir2512/NLP/blob/main/Text_Vectorization_Using_Traditional_Methods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Vectorization Using Traditional Methods

* There are many ways to vectorize a text into numeric representations.

* In traditional linguistic studies, linguists may manually annotate a text based on self-defined linguistic properties. These heuristics-based annotations can be easily converted into numeric values, thus in turn, vectorizing the text.

* In statistical language processing, it is important to reduce the effort of manual annotation and come up with ways to automatically vectorize a text.

* We will look at the most widely-used method in machine learning NLP, the bag-of-words method for text vectorization.


# Import necessary dependencies and settings

In [1]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd, numpy as np
import re, nltk
import matplotlib, matplotlib.pyplot as plt
%matplotlib inline

## Default style setting
matplotlib.rcParams['figure.dpi']=150
pd.options.display.max_colwidth=200


# Sample Corpus of Text Documents
* To have a quick intuition of how bag-of-words work, we start with a naive corpus, one consisting of eight documents. Each document is in fact a simple sentence.

* Each document in the corpus has a label (potentially referring to its topic).

In [2]:
corpus = [
    'The sky is blue and beautiful.', 'Love this blue and beautiful sky!',
    'The quick brown fox jumps over the lazy dog.',
    "A king's breakfast has sausages, ham, bacon, eggs, toast and beans",
    'I love green eggs, ham, sausages and bacon!',
    'The brown fox is quick and the blue dog is lazy!',
    'The sky is very blue and the sky is very beautiful today',
    'The dog is lazy but the brown fox is quick!'
]
labels = [
    'weather', 'weather', 'animals', 'food', 'food', 'animals', 'weather',
    'animals'
]

corpus = np.array(corpus) # np.array better than list
corpus_df = pd.DataFrame({'Document': corpus, 'Category': labels})
corpus_df

Unnamed: 0,Document,Category
0,The sky is blue and beautiful.,weather
1,Love this blue and beautiful sky!,weather
2,The quick brown fox jumps over the lazy dog.,animals
3,"A king's breakfast has sausages, ham, bacon, eggs, toast and beans",food
4,"I love green eggs, ham, sausages and bacon!",food
5,The brown fox is quick and the blue dog is lazy!,animals
6,The sky is very blue and the sky is very beautiful today,weather
7,The dog is lazy but the brown fox is quick!,animals


# Simple Text Preprocessing
* A few steps for text preprocessing
  * Remove special characters
  * Normalize letter case
  * Remove redundant spaces
  * Tokenize each document into word-tokens
  * Remove stop words

All these preprocessing steps are wrapped in one function, normalize_document().

In [3]:
nltk.download('stopwords')
wpt = nltk.WordPunctTokenizer()
stop_words = nltk.corpus.stopwords.words('english')

def normalize_document(doc):
  #lower case and remove special characters\whitespaces
  doc = re.sub(r'[^a-zA-Z\s]', '', doc,re.I | re.A)
  doc = doc.lower()
  doc = doc.strip()

  #tokenize document
  tokens = wpt.tokenize(doc)
  #filter stopwords out of document
  filtered_tokens = [token for token in tokens if token not in stop_words]
  #re-create document from filtered tokens
  doc = ' '.join(filtered_tokens)
  return doc

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
normalize_corpus = np.vectorize(normalize_document)

norm_corpus = normalize_corpus(corpus)
print(corpus)
print("_"*70)
print(norm_corpus)

['The sky is blue and beautiful.' 'Love this blue and beautiful sky!'
 'The quick brown fox jumps over the lazy dog.'
 "A king's breakfast has sausages, ham, bacon, eggs, toast and beans"
 'I love green eggs, ham, sausages and bacon!'
 'The brown fox is quick and the blue dog is lazy!'
 'The sky is very blue and the sky is very beautiful today'
 'The dog is lazy but the brown fox is quick!']
______________________________________________________________________
['sky blue beautiful' 'love blue beautiful sky'
 'quick brown fox jumps lazy dog'
 'kings breakfast sausages ham bacon eggs toast beans'
 'love green eggs ham sausages bacon' 'brown fox quick blue dog lazy'
 'sky blue sky beautiful today' 'dog lazy brown fox quick']


# Bag of Words Model
* Bag-of-words model is the simplest way (i.e., easy to be automated) to vectorize texts into numeric representations.

* In short, it is a method to represent a text using its word frequency list.

### CountVectorizer() from sklearn

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

In [6]:
cv = CountVectorizer(min_df=0., max_df=1.)
cv_matrix = cv.fit_transform(norm_corpus)
cv_matrix


<8x20 sparse matrix of type '<class 'numpy.int64'>'
	with 42 stored elements in Compressed Sparse Row format>

In [7]:
# view non-zero feature positions in the sparse matrix
print(cv_matrix)

  (0, 17)	1
  (0, 3)	1
  (0, 2)	1
  (1, 17)	1
  (1, 3)	1
  (1, 2)	1
  (1, 14)	1
  (2, 15)	1
  (2, 5)	1
  (2, 8)	1
  (2, 11)	1
  (2, 13)	1
  (2, 6)	1
  (3, 12)	1
  (3, 4)	1
  (3, 16)	1
  (3, 10)	1
  (3, 0)	1
  (3, 7)	1
  (3, 18)	1
  (3, 1)	1
  (4, 14)	1
  (4, 16)	1
  (4, 10)	1
  (4, 0)	1
  (4, 7)	1
  (4, 9)	1
  (5, 3)	1
  (5, 15)	1
  (5, 5)	1
  (5, 8)	1
  (5, 13)	1
  (5, 6)	1
  (6, 17)	2
  (6, 3)	1
  (6, 2)	1
  (6, 19)	1
  (7, 15)	1
  (7, 5)	1
  (7, 8)	1
  (7, 13)	1
  (7, 6)	1


In [8]:
#view dense representation
#warning might give a memory error if data is too big

cv_matrix = cv_matrix.toarray()
cv_matrix

array([[0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0],
       [1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0],
       [1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0],
       [0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0],
       [0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 1],
       [0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0]])

In [9]:
#get all unique words in the corpus
vocab = cv.get_feature_names_out()
#show document feature vectors
pd.DataFrame(cv_matrix, columns = vocab)

Unnamed: 0,bacon,beans,beautiful,blue,breakfast,brown,dog,eggs,fox,green,ham,jumps,kings,lazy,love,quick,sausages,sky,toast,today
0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0
2,0,0,0,0,0,1,1,0,1,0,0,1,0,1,0,1,0,0,0,0
3,1,1,0,0,1,0,0,1,0,0,1,0,1,0,0,0,1,0,1,0
4,1,0,0,0,0,0,0,1,0,1,1,0,0,0,1,0,1,0,0,0
5,0,0,0,1,0,1,1,0,1,0,0,0,0,1,0,1,0,0,0,0
6,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,1
7,0,0,0,0,0,1,1,0,1,0,0,0,0,1,0,1,0,0,0,0


* Issues with Bag-of-Words Text Representation

  * **Word order** is ignored.

  * **Raw** absolute frequency counts of words do not necessarily represent the meaning of the text properly.

  * **Marginal** frequencies play important roles. (Row and Columns)

# Improving Bag-of-Words Text Representation
* In BOW text representation, the most crucial question is to identify words that are indeed **representative** of the semantics of texts.

* To improve the BOW representation:

  * We can extend from unigram-based BOW model to **n-gram** based BOW model to consider partially the word order in texts.

  * We can **filter** words based on the distributional criteria (e.g., term frequencies) or morphosyntactic patterns (e.g., morphological endings).

  * We can **weight** the BOW raw frequency counts.



In CountVectorizer(), we can utilize its parameters:

* ***max_df***: When building the vocabulary, the vectorizer will ignore terms that have a **document frequency** strictly higher than the given threshold (corpus-specific stop words). float = the parameter represents a proportion of documents; integer = absolute counts.

* ***min_df:*** When building the vocabulary, the vectorizer will ignore terms that have a **document frequency** strictly lower than the given threshold. float = the parameter represents a proportion of documents; integer = absolute counts.

* ***max_features :*** Build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

* ***ngram_range :**** The lower and upper boundary of the range of n-values for different word n-grams. tuple (min_n, max_n), default=(1, 1).

* ***token_pattern:*** Regular expression denoting what constitutes a “token” in vocabulary. The default regexp select tokens of 2 or more alphanumeric characters (Note: **punctuation** is completely ignored and always treated as a token separator).

## N-gram Bag-of-Words Text Representation

In [10]:
# you can set the n-gram range to 1,2 to get unigrams as well as bigrams

#For example an ngram_range
#  (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams.

bv = CountVectorizer(ngram_range=(2,2))
bv_matrix = bv.fit_transform(norm_corpus)

bv_matrix = bv_matrix.toarray()
vocab = bv.get_feature_names_out()
pd.DataFrame(bv_matrix, columns=vocab)

Unnamed: 0,bacon eggs,beautiful sky,beautiful today,blue beautiful,blue dog,blue sky,breakfast sausages,brown fox,dog lazy,eggs ham,...,lazy dog,love blue,love green,quick blue,quick brown,sausages bacon,sausages ham,sky beautiful,sky blue,toast beans
0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1,0,1,0,1,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,1,0,0,...,1,0,0,0,1,0,0,0,0,0
3,1,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,1,0,0,1
4,0,0,0,0,0,0,0,0,0,1,...,0,0,1,0,0,1,0,0,0,0
5,0,0,0,0,1,0,0,1,1,0,...,0,0,0,1,0,0,0,0,0,0
6,0,0,1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,1,1,0
7,0,0,0,0,0,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0


In [11]:
tv = CountVectorizer(ngram_range=(1,3))
tv_matrix = tv.fit_transform(norm_corpus)

tv_matrix = tv_matrix.toarray()
vocab = tv.get_feature_names_out()
pd.DataFrame(tv_matrix, columns=vocab)

Unnamed: 0,bacon,bacon eggs,bacon eggs toast,beans,beautiful,beautiful sky,beautiful today,blue,blue beautiful,blue beautiful sky,...,sausages ham bacon,sky,sky beautiful,sky beautiful today,sky blue,sky blue beautiful,sky blue sky,toast,toast beans,today
0,0,0,0,0,1,0,0,1,1,0,...,0,1,0,0,1,1,0,0,0,0
1,0,0,0,0,1,1,0,1,1,1,...,0,1,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1,1,1,0,0,0,0,0,0,...,1,0,0,0,0,0,0,1,1,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,1,0,1,1,0,0,...,0,2,1,1,1,0,1,0,0,1
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# TF-IDF Model
* TF-IDF model is an extension of the bag-of-words model, whose main objective is to adjust the raw frequency counts by considering the **dispersion** of the words in the corpus.

* **Disperson** refers to how evenly each word/term is distributed across different documents of the corpus.

* Interaction between Word Raw Frequency Counts and Dispersion:

  * Given a **high-frequency** word:

    * If the word is widely dispersed across different documents of the corpus (i.e., high **dispersion**)

      * it is more likely to be semantically general.

    * If the word is mostly centralized in a limited set of documents in the corpus (i.e., low dispersion)

      * it is more likely to be topic-specific.

* Dispersion rates of words can be used as weights for the importance of word frequency counts.

* **Document Frequency (DF)** is an intuitive metric for measuring word dispersion across the corpus. DF refers to the number of documents where the word occurs (at least once).

* The inverse of the DF is referred to as **Inverse Document Frequency (IDF)**. IDF is usually computed as follows:

      IDF = 1+ log(N / 1+df)

** All these plus-1’s in the above formula are to avoid potential division-by-zero errors.

* The raw absolute frequency counts of words in the BOW model are referred to as Term Frequency (TF).

The TF-IDF Weighting Scheme:

      TF-IDF (normalized) = (tf * idf)/ sqrt((tf * idf)^2)


The tf-idf is normalized using the L2 norm, i.e., the Euclidean norm (taking the square root of the sum of the square of tfidf metrics).

**NOTE:**
* The L1 norm will drive some weights to 0, inducing sparsity in the weights. This can be beneficial for memory efficiency or when feature selection is needed (i.e., we want to select only certain weights).

* The L2 norm instead will reduce all weights but not all the way to 0. This is less memory efficient but can be useful if we want/need to retain all parameters.

# TfidfTransformer() from sklearn

In [12]:
from sklearn.feature_extraction.text import TfidfTransformer

In [13]:
tt = TfidfTransformer(norm='l2', use_idf=True, smooth_idf= True)
tt_matrix = tt.fit_transform(cv_matrix)

tt_matrix = tt_matrix.toarray()
vocab = cv.get_feature_names_out()
pd.DataFrame(np.round(tt_matrix,2), columns=vocab)

Unnamed: 0,bacon,beans,beautiful,blue,breakfast,brown,dog,eggs,fox,green,ham,jumps,kings,lazy,love,quick,sausages,sky,toast,today
0,0.0,0.0,0.6,0.53,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.6,0.0,0.0
1,0.0,0.0,0.49,0.43,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.57,0.0,0.0,0.49,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.38,0.38,0.0,0.38,0.0,0.0,0.53,0.0,0.38,0.0,0.38,0.0,0.0,0.0,0.0
3,0.32,0.38,0.0,0.0,0.38,0.0,0.0,0.32,0.0,0.0,0.32,0.0,0.38,0.0,0.0,0.0,0.32,0.0,0.38,0.0
4,0.39,0.0,0.0,0.0,0.0,0.0,0.0,0.39,0.0,0.47,0.39,0.0,0.0,0.0,0.39,0.0,0.39,0.0,0.0,0.0
5,0.0,0.0,0.0,0.37,0.0,0.42,0.42,0.0,0.42,0.0,0.0,0.0,0.0,0.42,0.0,0.42,0.0,0.0,0.0,0.0
6,0.0,0.0,0.36,0.32,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.72,0.0,0.5
7,0.0,0.0,0.0,0.0,0.0,0.45,0.45,0.0,0.45,0.0,0.0,0.0,0.0,0.45,0.0,0.45,0.0,0.0,0.0,0.0


# TfidVectorizer() from sklearn

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [15]:
tv = TfidfVectorizer(min_df = 0.,
                     max_df=1.,
                     norm='l2',
                     use_idf = True,
                     smooth_idf = True)

tv_matrix = tv.fit_transform(norm_corpus)
tv_matrix = tv_matrix.toarray()

vocab = tv.get_feature_names_out()
pd.DataFrame(np.round(tv_matrix,2), columns=vocab)

Unnamed: 0,bacon,beans,beautiful,blue,breakfast,brown,dog,eggs,fox,green,ham,jumps,kings,lazy,love,quick,sausages,sky,toast,today
0,0.0,0.0,0.6,0.53,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.6,0.0,0.0
1,0.0,0.0,0.49,0.43,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.57,0.0,0.0,0.49,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.38,0.38,0.0,0.38,0.0,0.0,0.53,0.0,0.38,0.0,0.38,0.0,0.0,0.0,0.0
3,0.32,0.38,0.0,0.0,0.38,0.0,0.0,0.32,0.0,0.0,0.32,0.0,0.38,0.0,0.0,0.0,0.32,0.0,0.38,0.0
4,0.39,0.0,0.0,0.0,0.0,0.0,0.0,0.39,0.0,0.47,0.39,0.0,0.0,0.0,0.39,0.0,0.39,0.0,0.0,0.0
5,0.0,0.0,0.0,0.37,0.0,0.42,0.42,0.0,0.42,0.0,0.0,0.0,0.0,0.42,0.0,0.42,0.0,0.0,0.0,0.0
6,0.0,0.0,0.36,0.32,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.72,0.0,0.5
7,0.0,0.0,0.0,0.0,0.0,0.45,0.45,0.0,0.45,0.0,0.0,0.0,0.0,0.45,0.0,0.45,0.0,0.0,0.0,0.0


# **Intuition of TF-IDF**

#Create Vocabulary Dictionary of the Corpus

In [16]:
norm_corpus

array(['sky blue beautiful', 'love blue beautiful sky',
       'quick brown fox jumps lazy dog',
       'kings breakfast sausages ham bacon eggs toast beans',
       'love green eggs ham sausages bacon',
       'brown fox quick blue dog lazy', 'sky blue sky beautiful today',
       'dog lazy brown fox quick'], dtype='<U51')

In [17]:
#get unique words as feature names

unique_words = list(set([word for doc in [doc.split() for doc in norm_corpus]
                         for word in doc]))

#default dict
def_feature_dict = {w:0 for w in unique_words}

print('Feature Names:', unique_words)
print('Default Feature Dict:', def_feature_dict)

Feature Names: ['jumps', 'eggs', 'bacon', 'beautiful', 'sausages', 'blue', 'love', 'lazy', 'toast', 'beans', 'dog', 'ham', 'green', 'fox', 'breakfast', 'brown', 'sky', 'quick', 'kings', 'today']
Default Feature Dict: {'jumps': 0, 'eggs': 0, 'bacon': 0, 'beautiful': 0, 'sausages': 0, 'blue': 0, 'love': 0, 'lazy': 0, 'toast': 0, 'beans': 0, 'dog': 0, 'ham': 0, 'green': 0, 'fox': 0, 'breakfast': 0, 'brown': 0, 'sky': 0, 'quick': 0, 'kings': 0, 'today': 0}


# Create Document-Word Matrix (Bag-of-Word Frequencies)

In [18]:
from collections import Counter
# build bag of words features for each document - term frequencies
bow_features = []
for doc in norm_corpus:
  bow_feature_doc = Counter(doc.split())
  # initialize default corpus dictionary
  all_features= Counter(def_feature_dict)

  #update default dict with current doc words
  bow_feature_doc.update(all_features)

  #append cur doc dict
  bow_features.append(bow_feature_doc)

bow_features = pd.DataFrame(bow_features)
bow_features

Unnamed: 0,sky,blue,beautiful,jumps,eggs,bacon,sausages,love,lazy,toast,beans,dog,ham,green,fox,breakfast,brown,quick,kings,today
0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,1,0,0,0,0,1,0,0,1,0,0,1,0,1,1,0,0
3,0,0,0,0,1,1,1,0,0,1,1,0,1,0,0,1,0,0,1,0
4,0,0,0,0,1,1,1,1,0,0,0,0,1,1,0,0,0,0,0,0
5,0,1,0,0,0,0,0,0,1,0,0,1,0,0,1,0,1,1,0,0
6,2,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
7,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,1,1,0,0


# Compute Document Frequency of Words

In [19]:
import scipy.sparse as sp
feature_names = list(bow_features.columns)

# build the document frequency matrix
df = np.diff(sp.csc_matrix(bow_features, copy=True).indptr)
# `csc_matrix()` compress `bow_features` into sparse matrix based on columns
# `csc_matrix.indices` stores the matrix value indices in each column
# `csc_matrix.indptr` stores the accumulative numbers of values from column-0 to the right-most column

df = 1 + df  # adding 1 to smoothen idf later

# show smoothened document frequencies
pd.DataFrame([df], columns = feature_names)

Unnamed: 0,sky,blue,beautiful,jumps,eggs,bacon,sausages,love,lazy,toast,beans,dog,ham,green,fox,breakfast,brown,quick,kings,today
0,4,5,4,2,3,3,3,3,4,2,2,4,3,2,4,2,4,4,2,2


# Create Inverse Document Frequency of Words

In [20]:
# compute inverse document frequencies for each term
total_docs = 1 + len(norm_corpus)
idf = 1.0 + np.log(float(total_docs) / df)

# show smoothened idfs
pd.DataFrame([np.round(idf, 2)], columns=feature_names)

Unnamed: 0,sky,blue,beautiful,jumps,eggs,bacon,sausages,love,lazy,toast,beans,dog,ham,green,fox,breakfast,brown,quick,kings,today
0,1.81,1.59,1.81,2.5,2.1,2.1,2.1,2.1,1.81,2.5,2.5,1.81,2.1,2.5,1.81,2.5,1.81,1.81,2.5,2.5


# Compute Raw TF-IDF for Each Document

In [21]:
# compute tfidf feature matrix
tf = np.array(bow_features, dtype='float64')
tfidf = tf * idf  ## `tf.shape` = (8,20), `idf.shape`=(20,)
# view raw tfidf feature matrix
pd.DataFrame(np.round(tfidf, 2), columns=feature_names)

Unnamed: 0,sky,blue,beautiful,jumps,eggs,bacon,sausages,love,lazy,toast,beans,dog,ham,green,fox,breakfast,brown,quick,kings,today
0,1.81,1.59,1.81,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.81,1.59,1.81,0.0,0.0,0.0,0.0,2.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,2.5,0.0,0.0,0.0,0.0,1.81,0.0,0.0,1.81,0.0,0.0,1.81,0.0,1.81,1.81,0.0,0.0
3,0.0,0.0,0.0,0.0,2.1,2.1,2.1,0.0,0.0,2.5,2.5,0.0,2.1,0.0,0.0,2.5,0.0,0.0,2.5,0.0
4,0.0,0.0,0.0,0.0,2.1,2.1,2.1,2.1,0.0,0.0,0.0,0.0,2.1,2.5,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,1.59,0.0,0.0,0.0,0.0,0.0,0.0,1.81,0.0,0.0,1.81,0.0,0.0,1.81,0.0,1.81,1.81,0.0,0.0
6,3.62,1.59,1.81,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.5
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.81,0.0,0.0,1.81,0.0,0.0,1.81,0.0,1.81,1.81,0.0,0.0


# Get L2 Norms of TF-IDF

In [22]:
from numpy.linalg import norm
# compute L2 norms
norms = norm(tfidf, axis=1)  # get the L2 forms of tfidf according to columns

# print norms for each document
print(np.round(norms, 3))

[3.013 3.672 4.761 6.534 5.319 4.35  5.019 4.049]


# Compute Normalized TF-IDF for Each Document

In [23]:
# compute normalized tfidf
norm_tfidf = tfidf / norms[:, None]

# show final tfidf feature matrix
pd.DataFrame(np.round(norm_tfidf, 2), columns=feature_names)

Unnamed: 0,sky,blue,beautiful,jumps,eggs,bacon,sausages,love,lazy,toast,beans,dog,ham,green,fox,breakfast,brown,quick,kings,today
0,0.6,0.53,0.6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.49,0.43,0.49,0.0,0.0,0.0,0.0,0.57,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.53,0.0,0.0,0.0,0.0,0.38,0.0,0.0,0.38,0.0,0.0,0.38,0.0,0.38,0.38,0.0,0.0
3,0.0,0.0,0.0,0.0,0.32,0.32,0.32,0.0,0.0,0.38,0.38,0.0,0.32,0.0,0.0,0.38,0.0,0.0,0.38,0.0
4,0.0,0.0,0.0,0.0,0.39,0.39,0.39,0.39,0.0,0.0,0.0,0.0,0.39,0.47,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.37,0.0,0.0,0.0,0.0,0.0,0.0,0.42,0.0,0.0,0.42,0.0,0.0,0.42,0.0,0.42,0.42,0.0,0.0
6,0.72,0.32,0.36,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.45,0.0,0.0,0.45,0.0,0.0,0.45,0.0,0.45,0.45,0.0,0.0


In [24]:
new_doc = 'the sky is green today'

pd.DataFrame(np.round(tv.transform([new_doc]).toarray(), 2),
             columns=tv.get_feature_names_out())

Unnamed: 0,bacon,beans,beautiful,blue,breakfast,brown,dog,eggs,fox,green,ham,jumps,kings,lazy,love,quick,sausages,sky,toast,today
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.63,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.46,0.0,0.63
