# So, What is TF-IDF?
*(if you're not interested in the theory part of TF-IDF, feel free to skip to the code section.)*
![](https://monkeylearn.com/static/679ad6824cd3f362d6081c38b8ef5824/35d2d/What-is-TF-IDF-Normal.png)

### ***Okay, theory time!***

**TF-IDF** is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents.

It has many uses, most importantly in automated text analysis, and is very useful for scoring words in machine learning algorithms for Natural Language Processing (NLP).

TF-IDF (**term frequency-inverse document frequency**) was invented for document search and information retrieval. It works by increasing proportionally to the number of times a word appears in a document, but is offset by the number of documents that contain the word. So, words that are common in every document, such as this, what, and if, rank low even though they may appear many times, since they don’t mean much to that document in particular.

However, if the word Bug appears many times in a document, while not appearing many times in others, it probably means that it’s very relevant. For example, if what we’re doing is trying to find out which topics some NPS responses belong to, the word Bug would probably end up being tied to the topic Reliability, since most responses containing that word would be about that topic.

### ***Math behind TF-IDF...***
**TF-IDF** for a word in a document is calculated by multiplying two different metrics:

- The **term frequency** of a word in a document. There are several ways of calculating this frequency, with the simplest being a raw count of instances a word appears in a document. Then, there are ways to adjust the frequency, by length of a document, or by the raw frequency of the most frequent word in a document.
- The **inverse document frequency** of the word across a set of documents. This means, how common or rare a word is in the entire document set. The closer it is to 0, the more common a word is. This metric can be calculated by taking the total number of documents, dividing it by the number of documents that contain a word, and calculating the logarithm.

So, if the word is very common and appears in many documents, this number will approach 0. Otherwise, it will approach 1.
Multiplying these two numbers results in the **TF-IDF score** of a word in a document. The higher the score, the more relevant that word is in that particular document.

To put it in more formal mathematical terms, the TF-IDF score for the word `t` in the document `d` from the document set `D` is calculated as follows:

![](https://monkeylearn.com/static/23b5e36265d19e9b42a9ae42220d257b/df264/1.png)

Where:

![](https://monkeylearn.com/static/d96cda57105351e7b75b844910ab3f73/df264/2.png)

![](https://monkeylearn.com/static/aaa7bf8149587b9b828f99e1db9f7e46/df264/3.png)

*Source:* [*https://monkeylearn.com/blog/what-is-tf-idf/*](https://monkeylearn.com/blog/what-is-tf-idf/)

**Without further ado, let's dive into the practical use-case of TF-IDF.**

**We'll be extracting the top 10 keywords from the published papers in [NeurIPS Conferences](https://www.kaggle.com/rowhitswami/nips-papers-1987-2019-updated/) from 1987 to 2019 using built-in algorithm of TD-IDF in Scikit-learn library.**

# Code

In [30]:
# Printing data files
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/stopwords/stopwords.txt
/kaggle/input/nips-papers-1987-2019-updated/papers.csv
/kaggle/input/nips-papers-1987-2019-updated/authors.csv
/kaggle/input/chevy-data/ChevroletRecallsIssues.csv


In [31]:
# General libraries
import re, os, string
import pandas as pd

# Scikit-learn importings
from sklearn.feature_extraction.text import TfidfVectorizer

In [32]:
def get_stopwords_list(stop_file_path):
    """load stop words """
    
    with open(stop_file_path, 'r', encoding="utf-8") as f:
        stopwords = f.readlines()
        stop_set = set(m.strip() for m in stopwords)
        return list(frozenset(stop_set))

In [33]:
def clean_text(text):
    """Doc cleaning"""
    
    # Lowering text
    text = text.lower()
    
    # Removing punctuation
    text = "".join([c for c in text if c not in PUNCTUATION])
    
    # Removing whitespace and newlines
    text = re.sub('\s+',' ',text)
    
    return text

In [34]:
def sort_coo(coo_matrix):
    """Sort a dict with highest score"""
    tuples = zip(coo_matrix.col, coo_matrix.data)
    return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)

def extract_topn_from_vector(feature_names, sorted_items, topn=10):
    """get the feature names and tf-idf score of top n items"""
    
    #use only topn items from vector
    sorted_items = sorted_items[:topn]

    score_vals = []
    feature_vals = []
    
    # word index and corresponding tf-idf score
    for idx, score in sorted_items:
        
        #keep track of feature name and its corresponding score
        score_vals.append(round(score, 3))
        feature_vals.append(feature_names[idx])

    #create a tuples of feature, score
    results= {}
    for idx in range(len(feature_vals)):
        results[feature_vals[idx]]=score_vals[idx]
    
    return results

In [35]:
def get_keywords(vectorizer, feature_names, doc):
    """Return top k keywords from a doc using TF-IDF method"""

    #generate tf-idf for the given document
    tf_idf_vector = vectorizer.transform([doc])
    
    #sort the tf-idf vectors by descending order of scores
    sorted_items=sort_coo(tf_idf_vector.tocoo())

    #extract only TOP_K_KEYWORDS
    keywords=extract_topn_from_vector(feature_names,sorted_items,TOP_K_KEYWORDS)
    
    return list(keywords.keys())

In [36]:
# Constants
PUNCTUATION = """!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~""" 
TOP_K_KEYWORDS = 10 # top k number of keywords to retrieve in a ranked document
STOPWORD_PATH = "/kaggle/input/stopwords/stopwords.txt"

# Reading data

In [37]:
data = pd.read_csv("/kaggle/input/chevy-data/ChevroletRecallsIssues.csv")
data.head()

Unnamed: 0,title,summary,consequence,correction
0,Component not known,Passenger vans equipped with a brake warning i...,,Dealers will inspect for three screw clamps to...
1,Component not known,"On these passenger vehicles, the lap belt webb...",,Dealers will install inserts to the belt web g...
2,Component not known,On certain passenger vehicles equipped with re...,,Dealers will replace the rear spindle bolts. T...
3,Component not known,Certain passenger vehicles fail to comply with...,,Dealers will inspect the fuel hose fill neck c...
4,Component not known,"Certain passenger vehicles, light duty pickup ...",,Dealers will inspect the vehicle's rear safety...


In [38]:
data.dropna(subset=['summary'], inplace=True)

# Preparing data

In [39]:
data['cleanReviews'] = data['summary'].apply(clean_text)

In [40]:
data.head()

Unnamed: 0,title,summary,consequence,correction,cleanReviews
0,Component not known,Passenger vans equipped with a brake warning i...,,Dealers will inspect for three screw clamps to...,passenger vans equipped with a brake warning i...
1,Component not known,"On these passenger vehicles, the lap belt webb...",,Dealers will install inserts to the belt web g...,on these passenger vehicles the lap belt webbi...
2,Component not known,On certain passenger vehicles equipped with re...,,Dealers will replace the rear spindle bolts. T...,on certain passenger vehicles equipped with re...
3,Component not known,Certain passenger vehicles fail to comply with...,,Dealers will inspect the fuel hose fill neck c...,certain passenger vehicles fail to comply with...
4,Component not known,"Certain passenger vehicles, light duty pickup ...",,Dealers will inspect the vehicle's rear safety...,certain passenger vehicles light duty pickup t...


In [41]:
corpora = data['cleanReviews'].to_list()

# Keywords Extraction using TF-IDF

In [42]:
#load a set of stop words
stopwords=get_stopwords_list(STOPWORD_PATH)

# Initializing TF-IDF Vectorizer with stopwords
vectorizer = TfidfVectorizer(stop_words=stopwords, smooth_idf=True, use_idf=True)

# Creating vocab with our corpora
# Exlcluding first 10 docs for testing purpose
vectorizer.fit_transform(corpora)

# Storing vocab
feature_names = vectorizer.get_feature_names()

  'stop_words.' % sorted(inconsistent))


# Result 🔥

In [43]:
result = []
for doc in corpora:
    df = {}
    df['issue'] = data['title']
    df['remedy'] = data['correction']
    df['full_text'] = doc
    df['top_keywords'] = get_keywords(vectorizer, feature_names, doc)
    result.append(df)
    
final = pd.DataFrame(result)
final.to_csv("ChevyKeywords.csv", index=False)

**Don't forget to upvote the notebook, if you like my work. Let me know your feedback in the comment section below. 😊**

**#StaySafe**