#### The tf-idf model is a statistical model which uses the occurrence and importance of a word within a corpus/document. This model is used for document retrieval, text mining, user modelling, and search engines. 

#### The model which I have built here uses a simplified  version of a tf-idf model using the term frequency and statistical score of a word within a selection of texts documents.

#### this simple model can help sort unlabeled documents, or even be expanded to aid in classification of documents for automated retrieval or feeds within data pipelines to other models

In [1]:
import nltk
import numpy as np


In [27]:

datasets = {
    "tfidf-1.txt":open("tfidf-1.txt", encoding="utf8").read().lower(),
    "tfidf-2.txt":open("tfidf-2.txt", encoding="utf8").read().lower(),
    "tfidf-3.txt":open("tfidf-3.txt", encoding="utf8").read().lower(),
    "tfidf-4.txt":open("tfidf-4.txt").read().lower(),
    "tfidf-5.txt":open("tfidf-5.txt").read().lower(),
    "tfidf-6.txt":open("tfidf-6.txt").read().lower(),
    "tfidf-7.txt":open("tfidf-7.txt").read().lower(),
    "tfidf-8.txt":open("tfidf-8.txt").read().lower(),
    "tfidf-9.txt":open("tfidf-9.txt").read().lower(),
    "tfidf-10.txt":open("tfidf-10.txt").read().lower(),    
}

In [28]:
datasets.keys()

dict_keys(['tfidf-9.txt', 'tfidf-10.txt', 'tfidf-2.txt', 'tfidf-6.txt', 'tfidf-7.txt', 'tfidf-1.txt', 'tfidf-8.txt', 'tfidf-4.txt', 'tfidf-5.txt', 'tfidf-3.txt'])

In [29]:
datasets['tfidf-6.txt']

'the french revolution was a period of far-reaching social and political upheaval in france that lasted from 1789 until 1799, and was partially carried forward by napoleon during the later expansion of the french empire. the revolution overthrew the monarchy, established a republic, experienced violent periods of political turmoil, and finally culminated in a dictatorship by napoleon that rapidly brought many of its principles to western europe and beyond. inspired by liberal and radical ideas, the revolution profoundly altered the course of modern history, triggering the global decline of absolute monarchies while replacing them with republics. through the revolutionary wars, it unleashed a wave of global conflicts that extended from the caribbean to the middle east. historians widely regard the revolution as one of the most important events in human history.\n\nthe causes of the french revolution are complex and are still debated among historians. following the seven years\' war and 

In [30]:
###number of times a word appears in document
def term_frequency(dataset, file_name):
    text = dataset[file_name]
    tokens = nltk.word_tokenize(text)
    fd = nltk.FreqDist(tokens)
    return fd

In [31]:
term_frequency(datasets, "tfidf-6.txt" )

FreqDist({'the': 105, 'of': 57, ',': 54, 'and': 30, '.': 28, 'in': 27, 'revolution': 18, 'to': 14, 'french': 13, 'a': 11, ...})

In [32]:
#number of documents that contain a specific word
import math
def inverse_document_frequency(dataset, term):
    count = [term in dataset[file_name] for file_name in dataset]
    inv_df = math.log(len(count)/sum(count))
    return inv_df

In [33]:
inverse_document_frequency(datasets,"world")

0.5108256237659907

In [34]:
def tfidf(dataset, file_name, n):
    term_scores = {}
    file_fd = term_frequency(datasets, file_name)
    for term in file_fd:
        if term.isalpha():
            idf_value = inverse_document_frequency(dataset, term)
            tf_value = term_frequency(dataset, file_name)[term]
            tfidf_value = tf_value * idf_value
            term_scores[term] = round(tfidf_value,2)
    return sorted(term_scores.items(), key=lambda x:-x[1])[:n]

#### find the top 10 terms within a given document/corpus

In [35]:
tfidf(datasets, 'tfidf-6.txt',10)

[('french', 15.65),
 ('revolution', 12.48),
 ('rights', 6.91),
 ('privileges', 6.91),
 ('napoleon', 6.44),
 ('revolutionary', 6.02),
 ('global', 4.83),
 ('popular', 4.83),
 ('leading', 4.83),
 ('directory', 4.83)]

#### Profile the who dataset and find the top 5 terms within the document/corpus

In [36]:
for file_name in datasets:
    print("{0}: \n {1}".format(file_name, tfidf(datasets,file_name,5)))

tfidf-9.txt: 
 [('rockefeller', 23.03), ('oil', 8.05), ('standard', 6.91), ('business', 6.91), ('university', 4.83)]
tfidf-10.txt: 
 [('tesla', 16.12), ('electrical', 6.44), ('wireless', 6.44), ('mechanical', 4.61), ('alternating', 4.61)]
tfidf-2.txt: 
 [('lunar', 20.72), ('module', 16.12), ('armstrong', 13.82), ('apollo', 11.51), ('spacecraft', 9.21)]
tfidf-6.txt: 
 [('french', 15.65), ('revolution', 12.48), ('rights', 6.91), ('privileges', 6.91), ('napoleon', 6.44)]
tfidf-7.txt: 
 [('leonardo', 18.42), ('painting', 6.91), ('vinci', 6.44), ('painter', 4.61), ('personality', 4.61)]
tfidf-1.txt: 
 [('soviet', 20.72), ('union', 18.42), ('axis', 16.12), ('germany', 11.27), ('allies', 11.27)]
tfidf-8.txt: 
 [('titanic', 18.42), ('passengers', 11.51), ('aboard', 9.21), ('lifeboats', 9.21), ('maritime', 9.21)]
tfidf-4.txt: 
 [('washington', 25.33), ('army', 4.83), ('president', 4.82), ('address', 4.61), ('colonial', 4.61)]
tfidf-5.txt: 
 [('newton', 23.03), ('mathematical', 6.91), ('scientis