# TF-IDF Algorithm for keyword extraction

Please run inside directory with Corpora Class!

In [4]:
import nltk
from models.Corpora import Corpora
from nltk.stem import WordNetLemmatizer
import math
from nltk import bigrams, trigrams

# Defining the TF-IDF algorithm

##### Defintion copied from other notebook i found online ;) 

TF-IDF is another way to convert textual data to a numeric form and is short for Term Frequency-Inverse Document Frequency. The vector value it yields is the product of these two terms; TF and IDF.

Let's first look at Term Frequency. We have already looked at term frequency above with count vectorizer, but this time, we need one more step to calculate the relative frequency. Let's say we have two documents in total as below.

1. I love dogs
2. I hate dogs and knitting

Relative term frequency is calculated for each term within each document as below.

$${TF(t,d)} = \frac {number\ of\ times\ term(t)\ appears\ in\ document(d)}{total\ number\ of\ terms\ in\ document(d)}$$

If we calculate inverse document frequency for 'I',

$${IDF('I',D)} = \log \Big(\frac {2}{2}\Big) = {0}$$

Once we have the values for TF and IDF, now we can calculate TFIDF as below.

$${TFIDF(t,d,D)} = {TF(t,d)}\cdot{IDF(t,D)}$$

Following the case of our example, TFIDF for term 'I' in both documents will be as below.

$${TFIDF('I',d1,D)} = {TF('I',d1)}\cdot{IDF('I',D)} = {0.33}\times{0} = {0}$$

$${TFIDF('I',d2,D)} = {TF('I',d2)}\cdot{IDF('I',D)} = {0.2}\times{0} = {0}$$

As you can see, the term 'I' appeared equally in both documents, and the TFIDF score is 0, which means the term is not really informative in differentiating documents. The rest is same as count vectorizer, TFIDF vectorizer will calculate these scores for terms in documents, and convert textual data into a numeric form.

In [5]:
def tf_idf(algorithms: list, corpora: Corpora) -> dict:
    """
    This method uses the tf-idf algorithm to determine the most relevant words in Corpus
    :return: list
    """
    """
    Defining all helper functions for tf*idf algorithm
    """
    def frequency(word: str, document: list) -> int:
        return document.count(word)

    def number_of_words(doc: str) -> int:
        return len(doc)

    def term_frequency(word: str, document: list) -> float:
        return float(frequency(word, document) / number_of_words(document))

    def number_of_docs_containing_word(word: str, documents: list) -> float:
        count: int = 0
        for document in documents:
            if frequency(word, document) > 0:
                count += 1
        return count

    def inverse_document_freq(word: str, documents: list) -> float:
        # log(Total number of documents / number of docs with the term)
        return math.log(len(documents) / number_of_docs_containing_word(word, documents))

    stopwords = nltk.corpus.stopwords.words('english')
    lemmatizer = WordNetLemmatizer()
    
    result_dict: dict = {}
    vocabulary: list = []
    # For function in class Topic_Engine
    # documents: list = [self.corpora.token_corpora[i.lower()] for i in algorithms]  # Already tokenized with RegExp
    documents = [corpora.token_corpora[i.lower()] for i in algorithms]
    for i, tokens in enumerate(documents):
        doc_id = "{}".format(algorithms[i].lower())
        
        # Double cleaning ugly but necessary because there are lemmatized words, lemmatized to "the"
        cleaned_tokens: list = [lemmatizer.lemmatize(token.lower()) for token in tokens if token not in stopwords]
        cleaned_tokens = [t for t in cleaned_tokens if t not in stopwords]

        bigram_tokens = bigrams(cleaned_tokens)  # Returns list of tupels
        bigram_tokens = [' '.join(token) for token in bigram_tokens]

        trigram_tokens: list = trigrams(cleaned_tokens)  # Returns list of tupels
        trigram_tokens: list = [' '.join(token) for token in trigram_tokens]

        all_tokens: list = []
        all_tokens.extend(cleaned_tokens)
        all_tokens.extend(bigram_tokens)
        all_tokens.extend(trigram_tokens)

        vocabulary.append(all_tokens)

        result_dict.update({doc_id: {}})
        for i, token in enumerate(all_tokens):
            result_dict[doc_id].update({token: {}})
            term_freq: float = term_frequency(token, all_tokens)
            result_dict[doc_id][token].update({'term_frequency': term_freq})

    for doc in result_dict:
        for token in result_dict[doc]:
            # Calculating IDF
            result_dict[doc][token].update({"inverse_document_frequency": inverse_document_freq(token, vocabulary)})
            # Calculating TF-IDF
            result_dict[doc][token].update(
                {"tf-idf": result_dict[doc][token]["term_frequency"] * result_dict[doc][token]["inverse_document_frequency"]})
    
    #  TODO Can be included in upper for loop for less code and little bit faster execution
    # Build new dict with only "token -> tf-idf"
    words = {}
    for doc in result_dict:
        words.update({doc: {}})
        for token in result_dict[doc]:
            if token not in words[doc]:
                words[doc].update({token: result_dict[doc][token]['tf-idf']})
            else:
                if result_dict[doc][token]['tf-idf'] > words[doc][token]:
                    words[doc].update({token: result_dict[doc][token]['tf-idf']})
    
    # Print out results
    for doc in words:
        words[doc] = sorted(words[doc].items(), key=lambda entry: entry[1], reverse=True)
        print("\n\n###### Results for algorithm: "+doc+" ######")
        for i, token_and_score in enumerate(words[doc]):
            print(token_and_score)
            if i == 14: 
                break
    return words

# Testing the algorithm

In [6]:
from models import Corpora
from models import Topic_Engine

corp = Corpora(["Clustering"], ["01_data/01_Clustering_definitions"])
corp.build_all_corpora_for_new_algorithm_type("Classification", "01_data/02_Classification_definitions")

print("Base text for clustering:\n")
print(corp.raw_corpora["clustering"])
print("\n\n")
print("Using tf-idf:")
td_idf_result = tf_idf(algorithms=["clustering", "Classification"], corpora=corp)

Base text for clustering:

 Clustering algorithms examine data to find groups of items that are similar. For example, an insurance company might group customers according to income, age, types of policy purchased or prior claims experience. In a fault diagnosis application, electrical faults might be grouped according to the values of certain key variables.
 Clustering is the process of making a group of abstract objects into classes of similar objects. Cluster analysis groups data objects based only on information found in the data that describes the objects and their relationships. The goal is that the objects within a group be similar (or related) to one another and different from (or unrelated to) the objects in other groups. The greater the similarity (or homogeneity) within a group and the greater the difference between groups, the better or more distinct the clustering. Clustering is the grouping of a particular set of objects based on their characteristics, aggregating them acc