#### What is Tf-idf?

Term frequency-inverse document frequency is a numerical statistic used to indicate how important a word is to each document in a collection of documents, or a corpus.

When applying tf-idf to a corpus, each word is given a tf-idf score for each document, representing the relevance of that word to the particular document. A higher tf-idf score indicates a term is more important to the corresponding document.

In [1]:
from preprocessing import preprocess_text
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# sample documents
document_1 = "This is a sample sentence!"
document_2 = "This is my second sentence."
document_3 = "Is this my third sentence?"

# corpus of documents
corpus = [document_1, document_2, document_3]

# preprocess documents
processed_corpus = [preprocess_text(doc) for doc in corpus]

# initialize and fit TfidfVectorizer
vectorizer = TfidfVectorizer(norm=None)
tf_idf_scores = vectorizer.fit_transform(processed_corpus)

# get vocabulary of terms
feature_names = vectorizer.get_feature_names()
corpus_index = [n for n in processed_corpus]

# create pandas DataFrame with tf-idf scores
df_tf_idf = pd.DataFrame(tf_idf_scores.T.todense(), index=feature_names, columns=corpus_index)
print(df_tf_idf)

          this be a sample sentence  this be my second sentence  \
be                         1.000000                    1.000000   
my                         0.000000                    1.287682   
sample                     1.693147                    0.000000   
second                     0.000000                    1.693147   
sentence                   1.000000                    1.000000   
third                      0.000000                    0.000000   
this                       1.000000                    1.000000   

          be this my third sentence  
be                         1.000000  
my                         1.287682  
sample                     0.000000  
second                     0.000000  
sentence                   1.000000  
third                      1.693147  
this                       1.000000  


In [2]:
# sample documents
document_1 = "Data Science is great!"
document_2 = "I can find insights with the skills from Data Science."
document_3 = "Machine Learning is great too?"

# corpus of documents
corpus = [document_1, document_2, document_3]

# preprocess documents
processed_corpus = [preprocess_text(doc) for doc in corpus]

# initialize and fit TfidfVectorizer
vectorizer = TfidfVectorizer(norm=None)
tf_idf_scores = vectorizer.fit_transform(processed_corpus)

# get vocabulary of terms
feature_names = vectorizer.get_feature_names()
corpus_index = [n for n in processed_corpus]

# create pandas DataFrame with tf-idf scores
df_tf_idf = pd.DataFrame(tf_idf_scores.T.todense(), index=feature_names, columns=corpus_index)
print(df_tf_idf)

         data science be great  \
be                    1.287682   
can                   0.000000   
data                  1.287682   
find                  0.000000   
from                  0.000000   
great                 1.287682   
insight               0.000000   
learn                 0.000000   
machine               0.000000   
science               1.287682   
skill                 0.000000   
the                   0.000000   
too                   0.000000   
with                  0.000000   

         i can find insight with the skill from data science  \
be                                                0.000000     
can                                               1.693147     
data                                              1.287682     
find                                              1.693147     
from                                              1.693147     
great                                             0.000000     
insight                                  

#### Term Frequency

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
poem = '''
Success is counted sweetest
By those who ne'er succeed.
To comprehend a nectar
Requires sorest need.

Not one of all the purple host
Who took the flag to-day
Can tell the definition,
So clear, of victory,

As he, defeated, dying,
On whose forbidden ear
The distant strains of triumph
Break, agonized and clear!'''

# define clear_count:
clear_count = 2

# preprocess text
processed_poem = preprocess_text(poem)

# initialize and fit CountVectorizer
vectorizer = CountVectorizer()
term_frequencies = vectorizer.fit_transform([processed_poem])

# get vocabulary of terms
feature_names = vectorizer.get_feature_names()

# create pandas DataFrame with term frequencies
try:
  df_term_frequencies = pd.DataFrame(term_frequencies.T.todense(), index=feature_names, columns=['Term Frequency'])
  print(df_term_frequencies)
except:
  pass

            Term Frequency
agonize                  1
all                      1
and                      1
be                       1
break                    1
by                       1
can                      1
clear                    2
comprehend               1
count                    1
day                      1
defeat                   1
definition               1
die                      1
distant                  1
ear                      1
er                       1
flag                     1
forbid                   1
he                       1
host                     1
ne                       1
nectar                   1
need                     1
not                      1
of                       3
on                       1
one                      1
purple                   1
require                  1
so                       1
sorest                   1
strain                   1
succeed                  1
success                  1
sweet                    1
t

#### Inverse Document Frequency

In [4]:

import pandas as pd
from sklearn.feature_extraction.text import TfidfTransformer
from term_frequency import term_frequencies, feature_names, df_term_frequencies

# display term-document matrix of term frequencies
print(df_term_frequencies)

# initialize and fit TfidfTransformer
transformer = TfidfTransformer(norm = None)

transformer.fit(term_frequencies)
idf_values = transformer.idf_


# create pandas DataFrame with inverse document frequencies
try:
  df_idf = pd.DataFrame(idf_values, index = feature_names, columns=['Inverse Document Frequency'])
  print(df_idf)
except:
  pass

         Poem 1  Poem 2  Poem 3  Poem 4  Poem 5  Poem 6
abash         0       0       0       0       1       0
across        0       0       0       1       0       0
admire        0       0       1       0       0       0
again         0       0       0       1       0       0
agonize       1       0       0       0       0       0
...         ...     ...     ...     ...     ...     ...
word          0       0       0       0       1       0
wreck         0       0       0       1       0       0
yet           0       0       0       0       1       0
you           0       0       3       0       0       0
your          0       0       1       0       0       0

[173 rows x 6 columns]
         Inverse Document Frequency
abash                      2.252763
across                     2.252763
admire                     2.252763
again                      2.252763
agonize                    2.252763
...                             ...
word                       2.252763
wreck           

#### TF-IDF

In [5]:
from poems import poems
# preprocess documents
processed_poems = [preprocess_text(poem) for poem in poems]

# initialize and fit TfidfVectorizer
vectorizer = TfidfVectorizer(norm = None)
tfidf_scores = vectorizer.fit_transform(processed_poems)


# get vocabulary of terms
feature_names = vectorizer.get_feature_names()

# get corpus index
corpus_index = [f"Poem {i+1}" for i in range(len(poems))]

# create pandas DataFrame with tf-idf scores
try:
  df_tf_idf = pd.DataFrame(tfidf_scores.T.todense(), index=feature_names, columns=corpus_index)
  print(df_tf_idf)
except:
  pass

           Poem 1  Poem 2    Poem 3    Poem 4    Poem 5  Poem 6
abash    0.000000     0.0  0.000000  0.000000  2.252763     0.0
across   0.000000     0.0  0.000000  2.252763  0.000000     0.0
admire   0.000000     0.0  2.252763  0.000000  0.000000     0.0
again    0.000000     0.0  0.000000  2.252763  0.000000     0.0
agonize  2.252763     0.0  0.000000  0.000000  0.000000     0.0
...           ...     ...       ...       ...       ...     ...
word     0.000000     0.0  0.000000  0.000000  2.252763     0.0
wreck    0.000000     0.0  0.000000  2.252763  0.000000     0.0
yet      0.000000     0.0  0.000000  0.000000  2.252763     0.0
you      0.000000     0.0  6.758289  0.000000  0.000000     0.0
your     0.000000     0.0  2.252763  0.000000  0.000000     0.0

[173 rows x 6 columns]


#### Converting Bag-Of-Words to TF-IDF

In [6]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfTransformer
from termfrequency import bow_matrix, feature_names, df_bag_of_words, corpus_index

# display term-document matrix of term frequencies (bag-of-words)
print(df_bag_of_words)

# initialize and fit TfidfTransformer, transform bag-of-words matrix
transformer = TfidfTransformer(norm = None)
tfidf_scores = transformer.fit_transform(bow_matrix)

# create pandas DataFrame with tf-idf scores
try:
  df_tf_idf = pd.DataFrame(tfidf_scores.T.todense(), index = feature_names, columns=corpus_index)
  print(df_tf_idf)
except:
  pass

         Poem 1  Poem 2  Poem 3  Poem 4  Poem 5  Poem 6
abash         0       0       0       0       1       0
across        0       0       0       1       0       0
admire        0       0       1       0       0       0
again         0       0       0       1       0       0
agonize       1       0       0       0       0       0
...         ...     ...     ...     ...     ...     ...
word          0       0       0       0       1       0
wreck         0       0       0       1       0       0
yet           0       0       0       0       1       0
you           0       0       3       0       0       0
your          0       0       1       0       0       0

[173 rows x 6 columns]
           Poem 1  Poem 2    Poem 3    Poem 4    Poem 5  Poem 6
abash    0.000000     0.0  0.000000  0.000000  2.252763     0.0
across   0.000000     0.0  0.000000  2.252763  0.000000     0.0
admire   0.000000     0.0  2.252763  0.000000  0.000000     0.0
again    0.000000     0.0  0.000000  2.252763  0

In [7]:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from raven import the_raven_stanzas
from preprocessing import preprocess_text

# view first stanza
print(the_raven_stanzas[0])


# preprocess documents
processed_stanzas = [preprocess_text(stanza) for stanza in the_raven_stanzas]

# initialize and fit TfidfVectorizer
vectorizer = TfidfVectorizer(norm = None)
tfidf_scores = vectorizer.fit_transform(processed_stanzas)

feature_names = vectorizer.get_feature_names()


# get vocabulary of terms


# get stanza index
stanza_index = [f"Stanza {i+1}" for i in range(len(the_raven_stanzas))]

# create pandas DataFrame with tf-idf scores
try:
  df_tf_idf = pd.DataFrame(tfidf_scores.T.todense(), index=feature_names, columns=stanza_index)
  print(df_tf_idf)
except:
  pass


Once upon a midnight dreary, while I pondered, weak and weary,
 Over many a quaint and curious volume of forgotten lore,
 While I nodded, nearly napping, suddenly there came a tapping,
 As of some one gently rapping, rapping at my chamber door
        Stanza 1  Stanza 2  Stanza 3  Stanza 4  Stanza 5   Stanza 6  Stanza 7  \
above        0.0       0.0  0.000000       0.0       0.0   0.000000       0.0   
adore        0.0       0.0  0.000000       0.0       0.0   0.000000       0.0   
again        0.0       0.0  0.000000       0.0       0.0   0.000000       0.0   
agree        0.0       0.0  0.000000       0.0       0.0   0.000000       0.0   
ah           0.0       0.0  3.079442       0.0       0.0   0.000000       0.0   
...          ...       ...       ...       ...       ...        ...       ...   
wretch       0.0       0.0  0.000000       0.0       0.0   0.000000       0.0   
yet          0.0       0.0  0.000000       0.0       0.0   0.000000       0.0   
yore         0.0       0.0

#### Mini Project


##### Read the News Analysis
Newspapers and their online formats supply the public with the information we need to understand the events occurring in the world around us. From politics to sports, the news keeps us informed, in the loop, and ready to make decisions about how to act in a rapidly changing world.

Given the vast amount of news articles in circulation, identifying and organizing articles by topic is a useful activity. This can help you sift through the enormous amount of information out there so you can find the news relevant to your interests, or even allow you to build a news recommendation engine!

The News International is the largest English language newspaper in Pakistan, covering local and international news across a variety of sectors. A selection of articles from a Kaggle Dataset of The News International articles is provided in the workspace.

In this project you will use term frequency-inverse document frequency (tf-idf) to analyze each article’s content and uncover the terms that best describe each article, providing quick insight into each article’s topic.

In [8]:
import pandas as pd
import numpy as np
from articles import articles
from preprocessing import preprocess_text

In [9]:
# import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer

# view article
print(articles[0])

# preprocess articles
processed_articles = [preprocess_text(article) for article in articles]
print(processed_articles)



# initialize and fit CountVectorizer
vectorizer = CountVectorizer(processed_articles)


# convert counts to tf-idf
counts = vectorizer.fit_transform(processed_articles)

transformer = TfidfTransformer(norm = None)
tfidf_scores_transformed = transformer.fit_transform(counts)


# initialize and fit TfidfVectorizer
vectorizer = TfidfVectorizer(norm = None)
tfidf_scores = vectorizer.fit_transform(processed_articles)





# check if tf-idf scores are equal
if np.allclose(tfidf_scores_transformed.todense(), tfidf_scores.todense()):
  print(pd.DataFrame({'Are the tf-idf scores the same?':['YES']}))
else:
  print(pd.DataFrame({'Are the tf-idf scores the same?':['No, something is wrong :(']}))






# get vocabulary of terms
try:
  feature_names = vectorizer.get_feature_names()
except:
  pass

# get article index
try:
  article_index = [f"Article {i+1}" for i in range(len(articles))]
except:
  pass

# create pandas DataFrame with word counts
try:
  df_word_counts = pd.DataFrame(counts.T.todense(), index=feature_names, columns=article_index)
  print(df_word_counts)
except:
  pass

# create pandas DataFrame(s) with tf-idf scores
try:
  df_tf_idf = pd.DataFrame(tfidf_scores_transformed.T.todense(), index=feature_names, columns=article_index)
  print(df_tf_idf)
except:
  pass

try:
  df_tf_idf = pd.DataFrame(tfidf_scores.T.todense(), index=feature_names, columns=article_index)
  print(df_tf_idf)
except:
  pass

# get highest scoring tf-idf term for each article
terms = [(df_tf_idf[[f'Article {i+1}']].idxmax()) for i in range(len(articles))]

print(terms)

KARACHI: The Sindh government has decided to bring down public transport fares by 7 per cent due to massive reduction in petroleum product prices by the federal government, Geo News reported.Sources said reduction in fares will be applicable on public transport, rickshaw, taxi and other means of traveling. Meanwhile, Karachi Transport Ittehad (KTI) has refused to abide by the government decision.KTI President Irshad Bukhari said the commuters are charged the lowest fares in Karachi as compare to other parts of the country, adding that 80pc vehicles run on Compressed Natural Gas (CNG). Bukhari said Karachi transporters will cut fares when decrease in CNG prices will be made.
['karachi the sindh government have decide to bring down public transport fare by 7 per cent due to massive reduction in petroleum product price by the federal government geo news report source say reduction in fare will be applicable on public transport rickshaw taxi and other mean of travel meanwhile karachi trans



5
