### Term Frequency - Inverse Document Frequency

Is a language model commonly known as tf-idf. Tf-idf is another powerful tool in your NLP toolkit that has a variety of use cases included:
- ranking results in a search engine
- text summarization
- building smarter chatbots

The output of applying tf-idf is a table, also known as a term-document matrix. You can think of a term-document matrix like a matrix of bag-of-word vectors.
Each column of the table represents a unique document (in this case, an individual sentence). Each row represents a unique word token. The value in each cell represents the tf-idf score for a word token in that particular document.

Term frequency-inverse document frequency is a numerical statistic used to indicate how important a word is to each document in a collection of documents, or a corpus.

When applying tf-idf to a corpus, each word is given a tf-idf score for each document, representing the relevance of that word to the particular document. A higher tf-idf score indicates a term is more important to the corresponding document.

Tf-idf has many similarities with the bag-of-words language model, which if you recall is concerned with word count — how many times each word appears in a document.

While tf-idf can be used in any situation bag-of-words can be used, there is a key difference in how it is calculated.

Tf-idf relies on two different metrics in order to come up with an overall score:
- term frequency, or how often a word appears in a document. This is the same as bag-of-words’ word count.
- inverse document frequency, which is a measure of how often a word appears in the overall corpus. By penalizing the score of words that appear throughout a corpus, tf-idf can give better insight into how important a word is to a particular document of a corpus.

In [1]:
import nltk, re
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from collections import Counter

stop_words = stopwords.words('english')
normalizer = WordNetLemmatizer()

def get_part_of_speech(word):
  probable_part_of_speech = wordnet.synsets(word)
  pos_counts = Counter()
  pos_counts["n"] = len(  [ item for item in probable_part_of_speech if item.pos()=="n"]  )
  pos_counts["v"] = len(  [ item for item in probable_part_of_speech if item.pos()=="v"]  )
  pos_counts["a"] = len(  [ item for item in probable_part_of_speech if item.pos()=="a"]  )
  pos_counts["r"] = len(  [ item for item in probable_part_of_speech if item.pos()=="r"]  )
  most_likely_part_of_speech = pos_counts.most_common(1)[0][0]
  return most_likely_part_of_speech

def preprocess_text(text):
  cleaned = re.sub(r'\W+', ' ', text).lower()
  tokenized = word_tokenize(cleaned)
  normalized = " ".join([normalizer.lemmatize(token, get_part_of_speech(token)) for token in tokenized])
  return normalized

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# sample documents
document_1 = "This is a sample sentence!"
document_2 = "This is my second sentence."
document_3 = "Is this my third sentence?"

# corpus of documents
corpus = [document_1, document_2, document_3]

# preprocess documents
processed_corpus = [preprocess_text(doc) for doc in corpus]

# initialize and fit TfidfVectorizer
vectorizer = TfidfVectorizer(norm=None)
tf_idf_scores = vectorizer.fit_transform(processed_corpus)

# get vocabulary of terms
feature_names = vectorizer.get_feature_names_out()
corpus_index = [n for n in processed_corpus]

# create pandas DataFrame with tf-idf scores
df_tf_idf = pd.DataFrame(tf_idf_scores.T.todense(), index=feature_names, columns=corpus_index)
print(df_tf_idf)

### Inverse Document Frequency
The inverse document frequency component of the tf-idf score penalizes terms that appear more frequently across a corpus. The intuition is that words that appear more frequently in the corpus give less insight into the topic or meaning of an individual document, and should thus be deprioritized.

For example, terms like “the” or “go” are used all over the place, so in a bag-of-words model, they would be given priority even though they don’t provide much meaning; tf-idf would deprioritize these sorts of common words.

Inverse document frequency can be calculated on a group of documents using scikit-learn’s TfidfTransformer:
```
transformer = TfidfTransformer(norm=None)
transformer.fit(term_frequencies)
inverse_doc_frequency = transformer.idf_
```
- a TfidfTransformer object is initialized. Don’t worry about the norm=None keyword argument for now, we will dig into this in the next exercise
- the TfidfTransformer is fit (trained) on a term-document matrix of term frequencies
- the .idf_ attribute of the TfidfTransformer stores the inverse document frequencies of the terms as a NumPy array


We can easily calculate the tf-idf values for each term-document pair in our corpus using scikit-learn’s TfidfVectorizer:
```
vectorizer = TfidfVectorizer(norm=None)
tfidf_vectorizer = vectorizer.fit_transform(corpus)
```
- a TfidfVectorizer object is initialized. The norm=None keyword argument prevents scikit-learn from modifying the multiplication of term frequency and inverse document frequency
- the TfidfVectorizer object is fit and transformed on the corpus of data, returning the tf-idf scores for each term-document pair



### Converting Bag-of-Words to Tf-idf
In addition to directly calculating the tf-idf scores for a set of terms across a corpus, you can also convert a bag-of-words model you have already created into tf-idf scores.

Scikit-learn’s TfidfTransformer is up to the task of converting your bag-of-words model to tf-idf. You begin by initializing a TfidfTransformer object.

`tf_idf_transformer = TfidfTransformer(norm=False)`

Given a bag-of-words matrix `count_matrix`, you can now multiply the term frequencies by their inverse document frequency to get the tf-idf scores as follows:

`tf_idf_scores = tfidf_transformer.fit_transform(count_matrix)`

This is very similar to how we calculated inverse document frequency, except this time we are fitting and transforming the TfidfTransformer to the term frequencies/bag-of-words vectors rather than just fitting the TfidfTransformer to them.


In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import import_ipynb
from the_raven import the_raven_stanzas

# view first stanza
print(the_raven_stanzas[0])

# preprocess documents
processed_stanzas = [preprocess_text(stanza) for stanza in the_raven_stanzas]

# initialize and fit TfidfVectorizer
vectorizer = TfidfVectorizer(norm=None)

# get vocabulary of terms
tfidf_scores = vectorizer.fit_transform(processed_stanzas)

# get stanza index
stanza_index = [f"Stanza {i+1}" for i in range(len(the_raven_stanzas))]
feature_names = vectorizer.get_feature_names_out()

# create pandas DataFrame with tf-idf scores
try:
  df_tf_idf = pd.DataFrame(tfidf_scores.T.todense(), index=feature_names, columns=stanza_index)
  print(df_tf_idf)
except:
  pass

### Working with Text Data | scikit-learn | From Occurrences to Frequencies
https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#from-occurrences-to-frequencies

In [4]:
from preprocessing import preprocess_text
import pandas as pd
import numpy as np
import import_ipynb
from articles import articles

# import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer

# preprocess articles
processed_articles = [preprocess_text(article) for article in articles]
print(processed_articles[2])
# initialize and fit CountVectorizer
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(processed_articles)

# convert counts to tf-idf
transformer = TfidfTransformer(norm=None)
tfidf_scores_transformed = transformer.fit_transform(counts)

# initialize and fit TfidfVectorizer
vectorizer = TfidfVectorizer(norm= None)
tfidf_scores = vectorizer.fit_transform(processed_articles)


# check if tf-idf scores are equal
if np.allclose(tfidf_scores_transformed.todense(), tfidf_scores.todense()):
  print(pd.DataFrame({'Are the tf-idf scores the same?':['YES']}))
else:
  print(pd.DataFrame({'Are the tf-idf scores the same?':['No, something is wrong :(']}))


# get vocabulary of terms
try:
  feature_names = vectorizer.get_feature_names_out()
except:
  pass

# get article index
try:
  article_index = [f"Article {i+1}" for i in range(len(articles))]
except:
  pass

# create pandas DataFrame with word counts
try:
  df_word_counts = pd.DataFrame(counts.T.todense(), index=feature_names, columns=article_index)
  print(df_word_counts)
except:
  pass

# create pandas DataFrame(s) with tf-idf scores
try:
  df_tf_idf = pd.DataFrame(tfidf_scores_transformed.T.todense(), index=feature_names, columns=article_index)
  print(df_tf_idf)
except:
  pass

try:
  df_tf_idf = pd.DataFrame(tfidf_scores.T.todense(), index=feature_names, columns=article_index)
  print(df_tf_idf)
except:
  pass

# get highest scoring tf-idf term for each article
for i in range(1,10):
  print(df_tf_idf[[f'Article {i}']].idxmax())


karachi wholesale market rat for sugar drop to less than r 50 per kg follow the resumption of sugar cane crush by sugar mill in sindh within two day the rate drop by r 1 70 to r 49 80 per kg in karachi whole sale market accord to dealer the resumption of sugar cane crush by the mill stabilise the supply to the market with an immediate effect on price a well industry expert say that the quality of sugar cane be excellent in sindh and approximately 100 kg of sugar cane can produce 11 kg of sugar
  Are the tf-idf scores the same?
0                             YES
       Article 1  Article 2  Article 3  Article 4  Article 5  Article 6  \
100            0          0          1          0          0          0   
11             0          0          1          0          0          0   
15             0          0          0          0          1          0   
158            0          1          0          0          0          0   
19             0          1          0          0         

The Pandas Series method .idxmax() is a helpful tool for returning the index of the highest value in a DataFrame column. We will use this method to find the highest scoring tf-idf term for each article.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.idxmax.html

Within the for loop, paste the following code:

`print(df_tf_idf[[f'Article {i}']].idxmax())`

On each pass through the for loop, this code will print the index of the term with the highest tf-idf score for that article (from Article 1 to Article 10).