# Keyword extraction with `TfidfVectorizer`

Scikit-learn's `CountVectorizer` class creates matrices of word counts and is frequently uses in text-classification tasks. The related [`TfidfVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) class creates matrices of [Term Freqeuency-Inverse Document Frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) (TFIDF) values that reflect not just the presence of individual words, but each word's importance. One use for `TfidfVectorizer` is extracting keywords from documents. Let's use it to extract keywords from a book chapter on machine learning. Begin by loading the chapter from a text file and showing the first few paragraphs.

In [1]:
import pandas as pd

df = pd.read_csv('Data/chapter-1.txt', sep='\r\n', engine='python', header=None)
pd.set_option('display.max_colwidth', None)
df.head()

Unnamed: 0,0
0,"Software developers are accustomed to solving problems algorithmically. Given a recipe, or algorithm, it's not difficult to write an app that hashes a password or computes a monthly mortgage payment. You code up the algorithm, feed it input, and receive output in return. It's another proposition altogether to write code that determines whether a photo contains a cat or a dog. You can try to do it algorithmically, but the minute you get it working, I'll send you a cat or dog picture that breaks the algorithm."
1,"Machine learning takes a different approach to turning input into output. Rather than rely on you to implement an algorithm, it examines a dataset consisting of inputs and outputs and learns how to generate output of its own. Under the hood, special algorithms called learning algorithms build mathematical models of the data and codify the relationship between data going in and data coming out. Once trained in this manner, a model can accept new inputs and generate outputs consistent with the ones in the training data."
2,"To use machine learning to distinguish between cats and dogs, you don't code a cat-vs-dog algorithm. Instead, you train a machine-learning model with cat and dog photos. Success depends on the learning algorithm used and the quality and volume of the training data. Part of becoming a machine-learning engineer is familiarizing yourself with the various learning algorithms and developing an intuition for when to use one versus another. That intuition begins with an examination of machine learning itself."
3,What is Machine Learning?
4,"At an existential level, machine learning (ML) is a means for finding patterns in numbers and exploiting those patterns to make predictions. Train a model with thousands (or millions) of xs and ys, and let it learn from the data so that given a new x, it can predict what y will be. Learning is the process by which ML finds patterns that can be used to predict future outputs, and it's where the 'learning' in 'machine learning' comes from."


Vectorize the paragraphs and show the first few lines of the resulting word matrix.

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(ngram_range=(2, 2), min_df=0.025, max_df=0.5, stop_words='english')
word_matrix = vectorizer.fit_transform(df[0])

feature_names = vectorizer.get_feature_names_out()
wm_df = pd.DataFrame(data=word_matrix.toarray(), columns=feature_names)
wm_df.head(10)

Unnamed: 0,1s 0s,add column,annual income,build mathematical,cat dog,classification models,column contains,credit card,data data,data points,...,sepal width,setosa versicolor,spending score,spending scores,supervised learning,train machine,training data,unsupervised learning,use following,versicolor virginica
0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.526645,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.411714,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.329181,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.297392,0.257343,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.727929,0.0,0.0,0.0,0.0,0.0,0.40287,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.67037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Convert the sparse word matrix into a coordinate matrix that includes only non-zero values (weights) and the rows and columns in which they appear.

In [3]:
coo_matrix = word_matrix.tocoo()
print(coo_matrix)

  (0, 4)	1.0
  (1, 49)	0.41171364214971173
  (1, 28)	0.5266452223745365
  (1, 3)	0.5266452223745365
  (1, 21)	0.43970283823932543
  (1, 25)	0.28712873491214824
  (2, 20)	0.27483761002183926
  (2, 22)	0.26560327297719427
  (2, 48)	0.29739224366595657
  (2, 49)	0.25734287700964276
  (2, 21)	0.27483761002183926
  (2, 25)	0.7178827918222094
  (2, 4)	0.3291812143547189
  (3, 25)	1.0
  (4, 26)	0.6079427215514522
  (4, 25)	0.7939808859869445
  (5, 6)	0.4028697432073993
  (5, 18)	0.38145874437527655
  (5, 10)	0.4028697432073993
  (5, 0)	0.7279293690706851
  (6, 30)	0.5768531568260111
  (6, 37)	0.5211465511802166
  (6, 25)	0.6290045370685583
  (7, 30)	0.7420272038414925
  (7, 0)	0.6703697701710424
  :	:
  (91, 40)	1.0
  (93, 49)	0.3686419838367719
  (93, 28)	0.47154993101664205
  (93, 3)	0.47154993101664205
  (93, 21)	0.39370307415818434
  (93, 25)	0.514181195949137
  (94, 7)	0.25937135995965965
  (94, 19)	0.46864782871828475
  (94, 47)	0.39376023290373574
  (94, 50)	0.19150502328554675
  (94, 

Create tuples from the column numbers and weights in the coordinate matrix. Then sort the tuples in descending order based on the weights.

In [4]:
tuples = list(zip(coo_matrix.col, coo_matrix.data))
sorted_tuples = sorted(tuples, key=lambda x: x[1], reverse=True)

for _, tuple in enumerate(sorted_tuples):
    print(f'{tuple} => {feature_names[tuple[0]]}')

(4, 1.0) => cat dog
(25, 1.0) => machine learning
(25, 1.0) => machine learning
(25, 1.0) => machine learning
(25, 1.0) => machine learning
(50, 1.0) => unsupervised learning
(12, 1.0) => following code
(29, 1.0) => means clustering
(29, 1.0) => means clustering
(9, 1.0) => data points
(47, 1.0) => supervised learning
(32, 1.0) => nearest neighbors
(32, 1.0) => nearest neighbors
(32, 1.0) => nearest neighbors
(32, 1.0) => nearest neighbors
(31, 1.0) => model accuracy
(27, 1.0) => making predictions
(36, 1.0) => predict class
(49, 1.0) => training data
(40, 1.0) => scikit learn
(32, 0.8805448238444277) => nearest neighbors
(37, 0.8561751021506758) => real world
(25, 0.803856205496376) => machine learning
(25, 0.7939808859869445) => machine learning
(49, 0.7731187523324384) => training data
(39, 0.7676246648600459) => right number
(2, 0.7500525583741241) => annual income
(32, 0.7441198535306921) => nearest neighbors
(30, 0.7420272038414925) => millions rows
(6, 0.7420272038414925) => col

Show the top keywords by weight and use `set` to eliminate duplicates.

In [5]:
keywords = []
num_keywords = 5

for tuple in sorted_tuples[:num_keywords]:
    keywords.append(feature_names[tuple[0]])
    
print(set(keywords))

{'cat dog', 'machine learning'}


Keyword extraction sometimes works better when you sum all the values for a given word and select the words yielding the highest sums rather than the words with the highest individual values. Sort keywords based on that criterion:

In [6]:
import numpy as np
summed_weights = pd.Series(dtype='float32')

for col_name, col_data in wm_df.items():
    summed_weights = pd.concat([summed_weights, pd.Series({ col_name: np.sum(col_data) })])
    
sorted_summed_weights = summed_weights.sort_values(ascending=False)
print(sorted_summed_weights)

machine learning             13.819216
nearest neighbors             6.901019
means clustering              6.447980
unsupervised learning         5.286333
supervised learning           4.542468
training data                 4.519290
data points                   4.133683
learning models               3.733430
real world                    3.273201
learning model                3.184781
1s 0s                         3.181189
learning algorithm            3.148522
following code                2.890544
use following                 2.778309
learning algorithms           2.697675
make predictions              2.642662
spending scores               2.513924
predict class                 2.400059
annual income                 2.342413
train machine                 2.226982
right number                  2.181838
scikit learn                  2.129959
labeled data                  2.092886
model accuracy                2.091969
following statements          2.025025
making predictions       

Show the top keywords by summed weights:

In [7]:
keywords = []

for idx, _ in sorted_summed_weights[:num_keywords].items():
    keywords.append(idx)
    
print(keywords)

['machine learning', 'nearest neighbors', 'means clustering', 'unsupervised learning', 'supervised learning']


If you read **chapter-1.txt**, you'll see that these keywords highlight some of the most important concepts introduced in the chapter.