# Term Frequency-Inverse Document Frequency

This feature will compare the frequencies of n-grams used in the comment to those used in the article. We could also augment this feature with WordNet to compare the frequencies of related words in both documents

The term will be defined as the n-grams in the comment. Term frequency will come from the frequency of a given term in the document that the comment is referring to. Inverse document frequency will be defined as the frequency of the term in all of the other articles that we have scraped. This could prove to be too much for a computer to perform in a reasonable amount of time. If this is the case, we can compare it to 10 other random articles or something like that

Going to first code a quick example of how TFIDF should work. The documents used and results garnered should match the wikipedia page for tfidf which can be found here: https://en.wikipedia.org/wiki/Tf%E2%80%93idf

In [1]:
from collections import Counter
import math

In [2]:
doc1 = 'this is a a sample'
doc2 = 'this is another another example example example'
corpus = [doc1, doc2]

In [3]:
def tf(term, document):
    term_arr = str.split(document)
    term_dict = Counter(term_arr)
    return term_dict[term] / len(term_arr)

In [4]:
def idf(term, corpus):
    numerator = len(corpus)
    count = 0
    for doc in corpus:
        term_arr = str.split(doc)
        if term in term_arr:
            count += 1
    if count == 0:
        return 0
    return math.log10(numerator/count)


In [5]:
def tfidf(term, document, corpus):
    return tf(term, document) * idf(term, corpus)

In [12]:
tfidf('sample', doc1, corpus)

0.06020599913279624

I think I can use this same exact code on actual reddit data. The term will be each word in a comment. Loop through every word, grab the tfidf score for each word in the comment, and sum these scores together to get the comment's tfidf score. Document is pure article text (need to look at what Sam did for word score comparisons to grab article text), corpus is just a list of the document variables (which, again, are just pure text)

In [17]:
import pandas as pd
topics_data = pd.read_csv('files/topics_data_119.csv')


In [23]:
topics_data.iloc[1]['text']

'jake gottlieb, the founder of visium. reuters/ rick wilking visium asset management, a multibillion-dollar hedge fund, has imploded in the biggest scandal to hit the industry in years.\n\nthe fund told investors of its plan to close in a letter friday. it\'s the most high-profile shutdown since authorities forced steve cohen\'s controversial sac capital to close in 2013.\n\n\n\na slew of factors — from a brewing insider-trading scandal to a contentious investment by visium\'s founder that rankled investors and staffers alike — led to visium\'s demise.\n\n\n\non top of all of that, visium\'s flagship fund had been reporting dismal performance. visium is selling one of its better performing funds to alliancebernstein, it said.\n\njust days before visium announced the shut down, one of its top portfolio managers — sanjay valvani — was charged with wire and securities fraud. he is accused of using insider information from a food and drug administration official to place trades on drug com