Term Frequency – How frequently a term occurs in a text. It is measured as the number of times a term t appears in the text / Total number of words in the document

Inverse Document Frequency – How important a word is in a document. It is measured as log(total number of sentences / Number of sentences with term t)

In [2]:
from nltk import tokenize
from operator import itemgetter
import math

In [3]:
doc = 'Specific visions for Web3 differ, and the term has been described by Bloomberg as hazy, but they revolve around the idea of decentralization and often incorporate blockchain technologies, such as various cryptocurrencies and non-fungible-tokens. Bloomberg has described Web3 as an idea that would build financial assets, in the form of tokens, into the inner workings of almost anything you do online. Some visions are based around the concept of decentralized-autonomous-organizations. Decentralized-finance is another key concept; in it, users exchange currency without bank or government involvement. Self-sovereign identity allows users to identify themselves without relying on an authentication system such as OAuth, in which a trusted party has to be reached in order to assess identity. Technology scholars have argued that Web3 would likely run in tandem with Web2 sites, with Web2 sites likely adopting Web3 technologies in order to keep their services relevant.'

In [6]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 

In [7]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [8]:
stop_words = set(stopwords.words('english'))

Count total number of words, including repetitions.

In [None]:
total_words = doc.split()
total_word_length = len(total_words)
print(total_word_length)

Count total number of sentences.

In [None]:
total_sentences = tokenize.sent_tokenize(doc)
total_sent_len = len(total_sentences)
print(total_sent_len)

Create dictionary for each word and could occurance.

In [15]:
tf_score = {}
for each_word in total_words:

    if each_word not in stop_words:
        if each_word in tf_score:
            tf_score[each_word] += 1
        else:
            tf_score[each_word] = 1

Dividing by total_word_length for each dictionary element

In [16]:
tf_score.update((x, y/int(total_word_length)) for x, y in tf_score.items())
print(tf_score)

{'Specific': 0.006993006993006993, 'visions': 0.013986013986013986, 'Web3': 0.027972027972027972, 'differ,': 0.006993006993006993, 'term': 0.006993006993006993, 'described': 0.013986013986013986, 'Bloomberg': 0.013986013986013986, 'hazy,': 0.006993006993006993, 'revolve': 0.006993006993006993, 'around': 0.013986013986013986, 'idea': 0.013986013986013986, 'decentralization': 0.006993006993006993, 'often': 0.006993006993006993, 'incorporate': 0.006993006993006993, 'blockchain': 0.006993006993006993, 'technologies,': 0.006993006993006993, 'various': 0.006993006993006993, 'cryptocurrencies': 0.006993006993006993, 'non-fungible-tokens.': 0.006993006993006993, 'would': 0.013986013986013986, 'build': 0.006993006993006993, 'financial': 0.006993006993006993, 'assets,': 0.006993006993006993, 'form': 0.006993006993006993, 'tokens,': 0.006993006993006993, 'inner': 0.006993006993006993, 'workings': 0.006993006993006993, 'almost': 0.006993006993006993, 'anything': 0.006993006993006993, 'online.': 0.

Function to check if word is in a sentence and extract sentences that satsify that criteria and return the length of that list.

In [18]:
def WordSent(word, sentences):
    final = [all([w in x for w in word]) for x in sentences]
    sent_len = [sentences[i] for i in range(0, len(final)) if final[i]]
    return int(len(sent_len))

In [19]:
idf_score = {}
for each_word in total_words:

    if each_word not in stop_words:
        if each_word in idf_score:
            idf_score[each_word] = WordSent(each_word, total_sentences)
        else:
            idf_score[each_word] = 1

# Performing a log and divide
idf_score.update((x, math.log(int(total_sent_len)/y)) for x, y in idf_score.items())
print(idf_score)

{'Specific': 1.791759469228055, 'visions': 0.1823215567939546, 'Web3': 0.6931471805599453, 'differ,': 1.791759469228055, 'term': 1.791759469228055, 'described': 0.0, 'Bloomberg': 1.0986122886681098, 'hazy,': 1.791759469228055, 'revolve': 1.791759469228055, 'around': 0.0, 'idea': 0.0, 'decentralization': 1.791759469228055, 'often': 1.791759469228055, 'incorporate': 1.791759469228055, 'blockchain': 1.791759469228055, 'technologies,': 1.791759469228055, 'various': 1.791759469228055, 'cryptocurrencies': 1.791759469228055, 'non-fungible-tokens.': 1.791759469228055, 'would': 0.4054651081081644, 'build': 1.791759469228055, 'financial': 1.791759469228055, 'assets,': 1.791759469228055, 'form': 1.791759469228055, 'tokens,': 1.791759469228055, 'inner': 1.791759469228055, 'workings': 1.791759469228055, 'almost': 1.791759469228055, 'anything': 1.791759469228055, 'online.': 1.791759469228055, 'Some': 1.791759469228055, 'based': 1.791759469228055, 'concept': 1.791759469228055, 'decentralized-autonomo

Multiply each key of each list for the corresponding word together to obtain TF-IDF.

In [20]:
tf_idf_score = {key: tf_score[key] * idf_score.get(key, 0) for key in tf_score.keys()}
print(tf_idf_score)

{'Specific': 0.012529786498098286, 'visions': 0.0025499518432720923, 'Web3': 0.019388732323355112, 'differ,': 0.012529786498098286, 'term': 0.012529786498098286, 'described': 0.0, 'Bloomberg': 0.015365206834519017, 'hazy,': 0.012529786498098286, 'revolve': 0.012529786498098286, 'around': 0.0, 'idea': 0.0, 'decentralization': 0.012529786498098286, 'often': 0.012529786498098286, 'incorporate': 0.012529786498098286, 'blockchain': 0.012529786498098286, 'technologies,': 0.012529786498098286, 'various': 0.012529786498098286, 'cryptocurrencies': 0.012529786498098286, 'non-fungible-tokens.': 0.012529786498098286, 'would': 0.0056708406728414595, 'build': 0.012529786498098286, 'financial': 0.012529786498098286, 'assets,': 0.012529786498098286, 'form': 0.012529786498098286, 'tokens,': 0.012529786498098286, 'inner': 0.012529786498098286, 'workings': 0.012529786498098286, 'almost': 0.012529786498098286, 'anything': 0.012529786498098286, 'online.': 0.012529786498098286, 'Some': 0.012529786498098286,

In [21]:
def TopN(dict_elem, n):
    result = dict(sorted(dict_elem.items(), key = itemgetter(1), reverse = True)[:n]) 
    return result

In [23]:
print(TopN(tf_idf_score, 5))

{'Web2': 0.025059572996196572, 'Web3': 0.019388732323355112, 'Bloomberg': 0.015365206834519017, 'Specific': 0.012529786498098286, 'differ,': 0.012529786498098286}
