# 01-TF-IDF

We will here compute the TF-IDF on a corpus of newspaper headlines.

Begin by importing needed libraries:

In [78]:
# import needed libraries
import nltk
import numpy as np
import pandas as pd

Import the data into the file *headlines.csv*

In [79]:
# TODO: Load the dataset
# Load the dataset
df = pd.read_csv("headlines.csv")
df.head()

Unnamed: 0,publish_date,headline_text
0,20170721,algorithms can make decisions on behalf of fed...
1,20170721,andrew forrests fmg to appeal pilbara native t...
2,20170721,a rural mural in thallan
3,20170721,australia church risks becoming haven for abusers
4,20170721,australian company usgfx embroiled in shanghai...


As usual, check the dataset basic information.

In [80]:
# TODO: Have a look at the data
print(df.head())
print(df.info())

   publish_date                                      headline_text
0      20170721  algorithms can make decisions on behalf of fed...
1      20170721  andrew forrests fmg to appeal pilbara native t...
2      20170721                           a rural mural in thallan
3      20170721  australia church risks becoming haven for abusers
4      20170721  australian company usgfx embroiled in shanghai...
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1999 entries, 0 to 1998
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   publish_date   1999 non-null   int64 
 1   headline_text  1999 non-null   object
dtypes: int64(1), object(1)
memory usage: 31.4+ KB
None


We will now perform preprocessing on this text data: tokenization, punctuation and stop words removal and stemming.

Hint: to do so, use NLTK, *pandas*'s method *apply*, lambda functions and list comprehension

In [81]:
# TODO: Perform preprocessing
# import needed modules
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string

def preprocess(text):
    # Tokenize the text
    tokens = nltk.word_tokenize(text.lower())
    
    # Remove punctuation
    tokens = [token for token in tokens if token not in string.punctuation]
    
    # Remove stop words
    stop_words = set(nltk.corpus.stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]
    
    # Stem
    stemmer = nltk.stem.PorterStemmer()
    tokens = [stemmer.stem(token) for token in tokens]
    
    return " ".join(tokens)

df['stemmed'] = df['headline_text'].apply(preprocess)
# Show the output
print(df['stemmed'])

0                algorithm make decis behalf feder minist
1       andrew forrest fmg appeal pilbara nativ titl rule
2                                     rural mural thallan
3                        australia church risk becom abus
4       australian compani usgfx embroil shanghai staf...
                              ...                        
1994    constitut avenu win top prize act architectu a...
1995                              dark mofo number crunch
1996    david petraeu say australia must firm south ch...
1997    driverless car australia face challeng roo pro...
1998                     drug compani criticis price hike
Name: stemmed, Length: 1999, dtype: object


Compute now the Bag of Words for our data, using scikit-learn.

Warning: since we used our own preprocessing, you have to bypass analyzer with identity function.

In [82]:
# TODO: Compute the BOW of the preprocessed data
# Import the library
from sklearn.feature_extraction.text import CountVectorizer


# Compute the BOW representation of the preprocessed data
vectorizer = CountVectorizer(analyzer=lambda x: x, stop_words='english')
bow = vectorizer.fit_transform(df['stemmed']).toarray()
print(bow.shape)

(1999, 38)


You can check the shape of the BOW, the expected value is `(1999, 4165)`.

Now compute the Term Frequency and then the Inverse Document Frequency, and check the values are not only zeros.

In [83]:
# TODO: Compute the TF using the BOW

tf = np.sum(bow, axis=0) / np.sum(bow)
print(tf)

[1.36008246e-01 7.54204691e-05 1.19415743e-03 1.03074641e-03
 8.67335395e-04 3.14251955e-04 3.01681876e-04 2.63971642e-04
 2.51401564e-04 4.52522815e-04 2.38831486e-04 3.01681876e-04
 8.51497096e-02 1.89933881e-02 3.86655605e-02 3.11486537e-02
 8.10644342e-02 1.51972245e-02 1.87545567e-02 2.38957186e-02
 6.70487970e-02 3.10480931e-03 1.22809664e-02 4.73640546e-02
 2.75913216e-02 5.86394147e-02 5.60876889e-02 2.46876336e-02
 1.84780149e-03 7.34846771e-02 4.73389144e-02 5.85639943e-02
 3.12366443e-02 9.45269880e-03 1.42418986e-02 2.42602509e-03
 9.12587676e-03 1.30728813e-03]


In [84]:
# TODO: Compute the IDF
from sklearn.feature_extraction.text import TfidfTransformer

idf = TfidfTransformer(use_idf=True).fit_transform(bow)

print(idf)

  (0, 31)	0.1729787380501601
  (0, 30)	0.18357889090424717
  (0, 29)	0.16659403809380424
  (0, 26)	0.08749484572182369
  (0, 25)	0.08609938784881604
  (0, 24)	0.32930566581031706
  (0, 23)	0.1837648442057908
  (0, 22)	0.15309537938449247
  (0, 20)	0.34095844679515935
  (0, 19)	0.2348496708070992
  (0, 18)	0.12972178878245524
  (0, 17)	0.2902351341817359
  (0, 16)	0.40884261448336146
  (0, 15)	0.2147088649368237
  (0, 14)	0.10011226400742812
  (0, 13)	0.1298723097496903
  (0, 12)	0.24444796124505672
  (0, 0)	0.39438269757200334
  (1, 34)	0.11788948961467856
  (1, 33)	0.13754041088197883
  (1, 32)	0.08465191760659271
  (1, 31)	0.2812233113573361
  (1, 30)	0.07461417538557386
  (1, 29)	0.3385540874360546
  (1, 27)	0.28293790711235167
  :	:
  (1997, 18)	0.1053944202990478
  (1997, 17)	0.1179029521740513
  (1997, 16)	0.3321703373924552
  (1997, 15)	0.08722172491404796
  (1997, 14)	0.24401314834432428
  (1997, 13)	0.10551671332501891
  (1997, 12)	0.3972108527105254
  (1997, 0)	0.384506592239

Compute finally the TF-IDF.

In [85]:
# TODO: compute the TF-IDF

from sklearn.feature_extraction.text import TfidfTransformer
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(bow)


What are the 10 words with the highest and lowest TF-IDF on average?

In [86]:
# TODO: Print the 10 words with the highest and lowest TF-IDF on average

avg_tfidf = np.asarray(tfidf.mean(axis=0)).flatten()
ind_sorted = np.argsort(avg_tfidf)

lowest_tfidf = [vectorizer.get_feature_names()[ind] for ind in ind_sorted[:10]]
highest_tfidf = [vectorizer.get_feature_names()[ind] for ind in ind_sorted[-10:][::-1]]

print("10 words with the lowest average TF-IDF:")
print(lowest_tfidf)
print("\n10 words with the highest average TF-IDF:")
print(highest_tfidf)

10 words with the lowest average TF-IDF:
['.', '8', '5', '6', '9', '4', '3', '7', '2', '1']

10 words with the highest average TF-IDF:
[' ', 'a', 'e', 'r', 'i', 't', 'n', 'o', 's', 'l']




Now let's compute the TF-IDF using scikit-learn on our preprocessed data (the one you used to compute the BOW).

In [87]:
# TODO: Compute the TF-IDF using scikit learn
from sklearn.feature_extraction.text import TfidfVectorizer


tfidf_vectorizer = TfidfVectorizer()

tfidf = tfidf_vectorizer.fit_transform(df['stemmed'])

Compare the 10 highest and lowest TF-IDF words on average to the ones you had by yourself.

In [88]:
# TODO: Print the 10 words with the highest and lowest TF-IDF on average
feature_names = tfidf_vectorizer.get_feature_names()

avg_tfidf = tfidf.mean(axis=0).A1

sorted_tfidf = np.argsort(avg_tfidf)

print("10 words with the highest average TF-IDF:")
print([feature_names[i] for i in sorted_tfidf[-10:][::-1]])
print("\n10 words with the lowest average TF-IDF:")
print([feature_names[i] for i in sorted_tfidf[:10]])

10 words with the highest average TF-IDF:
['australia', 'australian', 'new', 'polic', 'say', 'trump', 'man', 'wa', 'charg', 'sydney']

10 words with the lowest average TF-IDF:
['adel', 'melb', 'haw', 'coll', 'gw', 'syd', 'gcfc', 'nmfc', 'geel', 'fabio']


Do you have the same words? How do you explain it?

They are different. The implementation of the two methods may differ, which could lead to small numerical differences that affect the order of words in the final TF-IDF matrix. For example, the way term frequencies and inverse document frequencies are computed in the two implementations may be slightly different, which could affect the final TF-IDF scores and order of words.