# 01-TF-IDF

We will here compute the TF-IDF on a corpus of newspaper headlines.

Begin by importing needed libraries:

In [34]:
# import needed libraries
import nltk
import numpy as np
import pandas as pd

Import the data into the file *headlines.csv*

In [35]:
# TODO: Load the dataset
df = pd.read_csv("headlines.csv")

As usual, check the dataset basic information.

In [36]:
df.head()

Unnamed: 0,publish_date,headline_text
0,20170721,algorithms can make decisions on behalf of fed...
1,20170721,andrew forrests fmg to appeal pilbara native t...
2,20170721,a rural mural in thallan
3,20170721,australia church risks becoming haven for abusers
4,20170721,australian company usgfx embroiled in shanghai...


In [37]:
# TODO: Have a look at the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1999 entries, 0 to 1998
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   publish_date   1999 non-null   int64 
 1   headline_text  1999 non-null   object
dtypes: int64(1), object(1)
memory usage: 31.4+ KB


In [38]:
df.dtypes

publish_date      int64
headline_text    object
dtype: object

We will now perform preprocessing on this text data: tokenization, punctuation and stop words removal and stemming.

Hint: to do so, use NLTK, *pandas*'s method *apply*, lambda functions and list comprehension

In [39]:
# TODO: Perform preprocessing
#headline_df = df["headline_text"]

# import needed modules
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
nltk.download('punkt') 
nltk.download('stopwords') 
nltk.download('wordnet') 
nltk.download('averaged_perceptron_tagger') 
nltk.download('omw-1.4')

# Tokenize
df['preprocessed_hline'] = df.apply(lambda df: nltk.word_tokenize(df["headline_text"]), axis=1)

# Remove punctuation
df["preprocessed_hline"] = df['preprocessed_hline'].apply(lambda tokens: [word for word in tokens if word.isalpha()])

# Remove stop words
stop_words = stopwords.words('english')
df['preprocessed_hline'] = df["preprocessed_hline"].apply(lambda tokens: [word for word in tokens if word not in stop_words])

# Stem
stemmer = PorterStemmer()
df["preprocessed_hline"] = df['preprocessed_hline'].apply(lambda tokens: [stemmer.stem(word) for word in tokens])

df["preprocessed_hline"]

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\61406\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\61406\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\61406\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\61406\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\61406\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


0         [algorithm, make, decis, behalf, feder, minist]
1       [andrew, forrest, fmg, appeal, pilbara, nativ,...
2                                 [rural, mural, thallan]
3                  [australia, church, risk, becom, abus]
4       [australian, compani, usgfx, embroil, shanghai...
                              ...                        
1994    [constitut, avenu, win, top, prize, act, archi...
1995                         [dark, mofo, number, crunch]
1996    [david, petraeu, say, australia, must, firm, s...
1997    [driverless, car, australia, face, challeng, r...
1998               [drug, compani, criticis, price, hike]
Name: preprocessed_hline, Length: 1999, dtype: object

In [48]:
df.head()

Unnamed: 0,publish_date,headline_text,preprocessed_hline
0,20170721,algorithms can make decisions on behalf of fed...,"[algorithm, make, decis, behalf, feder, minist]"
1,20170721,andrew forrests fmg to appeal pilbara native t...,"[andrew, forrest, fmg, appeal, pilbara, nativ,..."
2,20170721,a rural mural in thallan,"[rural, mural, thallan]"
3,20170721,australia church risks becoming haven for abusers,"[australia, church, risk, becom, abus]"
4,20170721,australian company usgfx embroiled in shanghai...,"[australian, compani, usgfx, embroil, shanghai..."


Compute now the Bag of Words for our data, using scikit-learn.

Warning: since we used our own preprocessing, you have to bypass analyzer with identity function.

In [40]:
# TODO: Compute the BOW of the preprocessed data
from sklearn.feature_extraction.text import CountVectorizer

count_vec = CountVectorizer(stop_words = "english", lowercase=False, analyzer=lambda x: x)
BOW = count_vec.fit_transform(df["preprocessed_hline"]).toarray()

BOW


array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [41]:
print(BOW.shape)

(1999, 4165)


You can check the shape of the BOW, the expected value is `(1999, 4165)`.

Now compute the Term Frequency and then the Inverse Document Frequency, and check the values are not only zeros.

In [42]:
# TODO: Compute the TF using the BOW
BOW_df = pd.DataFrame(data = BOW, columns=count_vec.get_feature_names())
term_freq = BOW_df
term_freq = term_freq.divide(term_freq.sum(axis=1), axis=0)

np.unique(term_freq)



array([0.        , 0.08333333, 0.09090909, 0.1       , 0.11111111,
       0.125     , 0.14285714, 0.16666667, 0.18181818, 0.2       ,
       0.22222222, 0.25      , 0.28571429, 0.33333333, 0.4       ,
       0.5       , 1.        ])

In [43]:
# TODO: Compute the IDF
IDF = BOW_df
IDF[IDF > 1] = 1

#method to avoid divide by zero
#IDF = np.log((1 + len(IDF)) / (1 + IDF.sum(axis=0)))
IDF = np.log((len(IDF)) / (IDF.sum(axis=0)))

np.unique(IDF)

array([3.28291422, 3.36629583, 3.44151925, 3.53995932, 3.57505064,
       3.70858204, 3.79373984, 3.83920222, 3.91152288, 3.96281617,
       4.04505427, 4.10389477, 4.13466643, 4.16641513, 4.19920495,
       4.2331065 , 4.26819782, 4.30456547, 4.3423058 , 4.38152651,
       4.4223485 , 4.46490812, 4.50935988, 4.5558799 , 4.60467006,
       4.65596336, 4.71003058, 4.76718899, 4.82781361, 4.89235213,
       4.961345  , 5.03545298, 5.11549568, 5.20250706, 5.29781724,
       5.40317776, 5.52096079, 5.65449219, 5.80864287, 5.99096442,
       6.21410797, 6.50179005, 6.90725515, 7.60040233])

Compute finally the TF-IDF.

In [44]:
# TODO: compute the TF-IDF
TF_IDF = term_freq * IDF.values

What are the 10 words with the highest and lowest TF-IDF on average?

In [45]:
# TODO: Print the 10 words with the highest and lowest TF-IDF on average
tf_idf_avg = TF_IDF.max(axis=0).sort_values()
print("Lowest TF-IDF on Average:\n", tf_idf_avg[:10])
print("\nHighest TF-IDF on Average:\n", tf_idf_avg[-10:])

Lowest TF-IDF on Average:
 gcfc    0.633367
geel    0.633367
gw      0.633367
haw     0.633367
melb    0.633367
coll    0.633367
adel    0.633367
syd     0.633367
nmfc    0.633367
cold    0.690456
dtype: float64

Highest TF-IDF on Average:
 date         3.800201
mongolian    3.800201
puffbal      3.800201
mous         3.800201
rig          3.800201
superannu    3.800201
aquapon      3.800201
loophol      3.800201
pump         6.907255
peacemak     7.600402
dtype: float64


Now let's compute the TF-IDF using scikit-learn on our preprocessed data (the one you used to compute the BOW).

In [46]:
# TODO: Compute the TF-IDF using scikit learn
# Import the module
from sklearn.feature_extraction.text import TfidfVectorizer

# Instantiate the TF-IDF vectorizer
TFIDF_vec = TfidfVectorizer(lowercase = False, analyzer = lambda x: x)

# Compute the TF-IDF
TF_IDF_sk = TFIDF_vec.fit_transform(df["preprocessed_hline"]).toarray()
TF_IDF_sk

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

Compare the 10 highest and lowest TF-IDF words on average to the ones you had by yourself.

In [47]:
IT_IDF_skdf = pd.DataFrame(data = TF_IDF_sk, columns=TFIDF_vec.get_feature_names())
it_idf_sk_avg = IT_IDF_skdf.max(axis=0).sort_values()

print("Lowest TF-IDF (sk-learn method) on Average:\n", it_idf_sk_avg[:10])
print("\nHighest TF-IDF (sk-learn method) on Average:\n", it_idf_sk_avg[-10:])

Lowest TF-IDF (sk-learn method) on Average:
 coll     0.305258
gw       0.305258
nmfc     0.305258
adel     0.305258
melb     0.305258
syd      0.305258
haw      0.305258
geel     0.305258
gcfc     0.305258
fabio    0.322574
dtype: float64

Highest TF-IDF (sk-learn method) on Average:
 mosul        0.779137
rig          0.786813
travel       0.788050
aquapon      0.794899
date         0.794899
employ       0.795060
financ       0.803629
mongolian    0.831769
pump         1.000000
peacemak     1.000000
dtype: float64




Do you have the same words? How do you explain it?

No, the words are very different. This is due to the method differences in formula and normalisation techniques. The custom calculation method (method 1) used uses the general formula. This includes IDFusing binary values between 0 and 1, while replacing all values over 1 to just 1. This could shift values in the output. SK-Learn might also use a differrent normalisation technique of data. It often uses L2norm, which can effect the scale of TF IDF values. Scikit learn also usually uses a smoothing value of +1 to the term frequency as sublinear scaling, replacing it with 1 + log(TF)