# 01-TF-IDF

We will here compute the TF-IDF on a corpus of newspaper headlines.

Begin by importing needed libraries:

In [52]:
# import needed libraries
import nltk
import numpy as np
import pandas as pd

Import the data into the file *headlines.csv*

In [53]:
# TODO: Load the dataset
df = pd.read_csv('headlines.csv')

As usual, check the dataset basic information.

In [54]:
# TODO: Have a look at the data
print(df.head())
df.info()

   publish_date                                      headline_text
0      20170721  algorithms can make decisions on behalf of fed...
1      20170721  andrew forrests fmg to appeal pilbara native t...
2      20170721                           a rural mural in thallan
3      20170721  australia church risks becoming haven for abusers
4      20170721  australian company usgfx embroiled in shanghai...
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1999 entries, 0 to 1998
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   publish_date   1999 non-null   int64 
 1   headline_text  1999 non-null   object
dtypes: int64(1), object(1)
memory usage: 31.4+ KB


We will now perform preprocessing on this text data: tokenization, punctuation and stop words removal and stemming.

Hint: to do so, use NLTK, *pandas*'s method *apply*, lambda functions and list comprehension

In [55]:
# TODO: Perform preprocessing
# import needed modules
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# Tokenize
df['headline_text'] = df['headline_text'].astype(str)
df['headline_text'] = df['headline_text'].apply(word_tokenize)

# Remove punctuation
df['headline_text'] = df['headline_text'].apply(lambda x: [word for word in x if word.isalnum()])

# Remove stop words
stop_words = set(stopwords.words('english'))
df['headline_text'] = df['headline_text'].apply(lambda x: [word for word in x if word not in stop_words])

# Stem
stemmer = PorterStemmer()
df['headline_text'] = df['headline_text'].apply(lambda x: [stemmer.stem(word) for word in x])

print(df['headline_text'].head())

0      [algorithm, make, decis, behalf, feder, minist]
1    [andrew, forrest, fmg, appeal, pilbara, nativ,...
2                              [rural, mural, thallan]
3               [australia, church, risk, becom, abus]
4    [australian, compani, usgfx, embroil, shanghai...
Name: headline_text, dtype: object


Compute now the Bag of Words for our data, using scikit-learn.

Warning: since we used our own preprocessing, you have to bypass analyzer with identity function.

In [56]:
# TODO: Compute the BOW of the preprocessed data
from sklearn.feature_extraction.text import CountVectorizer

df['headline_text'] = df['headline_text'].apply(lambda x: ' '.join(x))

vectorizer = CountVectorizer()
X_bow = vectorizer.fit_transform(df['headline_text'])

print(X_bow.shape)

(1999, 4251)


You can check the shape of the BOW, the expected value is `(1999, 4165)`.

Now compute the Term Frequency and then the Inverse Document Frequency, and check the values are not only zeros.

In [65]:
# TODO: Compute the TF using the BOW
bow_matrix = X_bow.toarray()

# Compute the term frequencies (TF)
tf_matrix = bow_matrix / bow_matrix.sum(axis=1, keepdims=True)

print(tf_matrix)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [66]:
# TODO: Compute the IDF
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit(df['headline_text'])

idf_values = tfidf_vectorizer.idf_
words = tfidf_vectorizer.get_feature_names_out()

print(idf_values)

[6.99146455 7.2146081  7.90775528 ... 7.90775528 7.90775528 7.90775528]


Compute finally the TF-IDF.

In [67]:
# TODO: compute the TF-IDF
X_tfidf = tfidf_vectorizer.fit_transform(df['headline_text'])

print("TF-IDF array:\n", X_tfidf.toarray())
print("Feature names:\n", tfidf_vectorizer.get_feature_names_out())

TF-IDF array:
 [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
Feature names:
 ['10' '100' '1000km' ... 'zone' 'zonta' 'zoo']


What are the 10 words with the highest and lowest TF-IDF on average?

In [60]:
# TODO: Print the 10 words with the highest and lowest TF-IDF on average
tfidf_scores = np.mean(X_tfidf.toarray(), axis=0)
words = tfidf_vectorizer.get_feature_names_out()

highest_indices = np.argsort(tfidf_scores)[-10:]
lowest_indices = np.argsort(tfidf_scores)[:10]

highest_words = [(words[i], tfidf_scores[i]) for i in highest_indices]
lowest_words = [(words[i], tfidf_scores[i]) for i in lowest_indices]

print("Highest TF-IDF words:", highest_words)
print("Lowest TF-IDF words:", lowest_words)

Highest TF-IDF words: [('sydney', 0.005659788840016151), ('charg', 0.006028832916829904), ('wa', 0.006274671593818188), ('man', 0.006548453421337382), ('trump', 0.006840891998202155), ('say', 0.007555848605072935), ('polic', 0.007736059204748111), ('new', 0.008703107457097207), ('australian', 0.009729510942149733), ('australia', 0.009983014998891405)]
Lowest TF-IDF words: [('geel', 0.0001527054029533165), ('gcfc', 0.0001527054029533165), ('adel', 0.0001527054029533165), ('melb', 0.0001527054029533165), ('coll', 0.0001527054029533165), ('syd', 0.0001527054029533165), ('gw', 0.0001527054029533165), ('haw', 0.0001527054029533165), ('nmfc', 0.0001527054029533165), ('fabio', 0.00016136766779501044)]


Now let's compute the TF-IDF using scikit-learn on our preprocessed data (the one you used to compute the BOW).

In [61]:
# TODO: Compute the TF-IDF using scikit learn
# Import the module

# Instantiate the TF-IDF vectorizer

# Compute the TF-IDF


Compare the 10 highest and lowest TF-IDF words on average to the ones you had by yourself.

In [62]:
# TODO: Print the 10 words with the highest and lowest TF-IDF on average


Do you have the same words? How do you explain it?