# Term Frequency-Inverse Document Frequency

This feature will compare the frequencies of n-grams used in the comment to those used in the article. We could also augment this feature with WordNet to compare the frequencies of related words in both documents

The term will be defined as the n-grams in the comment. Term frequency will come from the frequency of a given term in the document that the comment is referring to. Inverse document frequency will be defined as the frequency of the term in all of the other articles that we have scraped. This could prove to be too much for a computer to perform in a reasonable amount of time. If this is the case, we can compare it to 10 other random articles or something like that

Going to first code a quick example of how TFIDF should work. The documents used and results garnered should match the wikipedia page for tfidf which can be found here: https://en.wikipedia.org/wiki/Tf%E2%80%93idf

In [1]:
from collections import Counter
import math

In [2]:
doc1 = 'this is a a sample'
doc2 = 'this is another another example example example'
corpus = [doc1, doc2]

In [3]:
def tf(term, document):
    term_arr = str.split(document)
    term_dict = Counter(term_arr)
    return term_dict[term] / len(term_arr)

In [4]:
def idf(term, corpus):
    numerator = len(corpus)
    count = 0
    for doc in corpus:
        term_arr = str.split(doc)
        if term in term_arr:
            count += 1
    if count == 0:
        return 0
    return math.log10(numerator/count)


In [5]:
def tfidf(term, document, corpus):
    return tf(term, document) * idf(term, corpus)

In [6]:
tfidf('sample', doc1, corpus)

0.06020599913279624

I think I can use this same exact code on actual reddit data. The term will be each word in a comment. Loop through every word, grab the tfidf score for each word in the comment, and sum these scores together to get the comment's tfidf score. Document is pure article text (need to look at what Sam did for word score comparisons to grab article text), corpus is just a list of the document variables (which, again, are just pure text)

In [7]:
import pandas as pd
topics_data = pd.read_csv('files/compiled_topics.csv')
comments_data = pd.read_csv('files/compiled_comments.csv')

In [8]:
topics_data

Unnamed: 0,title,score,id,url,comms_num,created,body,text
0,[META] Welcome to NeutralNews,225,4o2o29,https://www.reddit.com/r/neutralnews/comments/...,66,1.465955e+09,The goal of /r/NeutralNews is to provide a pla...,the goal of r/neutralnews is to provide a plac...
1,Wall Street has been rocked by an $8 billion h...,183,4op948,http://www.businessinsider.com/visium-asset-ma...,10,1.466297e+09,,"jake gottlieb, the founder of visium. reuters/..."
2,How a $2 Roadside Drug Test Sends Innocent Peo...,91,4sef35,http://www.nytimes.com/2016/07/10/magazine/how...,11,1.468315e+09,,field tests provide quick answers. but if thos...
3,Sessions spoke twice with Russian ambassador d...,668,5x0k84,https://www.washingtonpost.com/world/national-...,84,1.488450e+09,,error
4,The Russia story just keeps getting worse for ...,90,64zsim,http://www.cnn.com/2017/04/12/politics/trump-c...,31,1.492048e+09,,washington (cnn) two stories dealing with russ...
...,...,...,...,...,...,...,...,...
1429,How does Athenian direct democracy compare to ...,15,bbyvcq,https://www.reddit.com/r/NeutralPolitics/comme...,13,1.555012e+09,Athenian democracy is an interesting form of g...,Athenian democracy is an interesting form of g...
1430,What are the pros and cons of the turnover at ...,497,bc04va,https://www.reddit.com/r/NeutralPolitics/comme...,171,1.555020e+09,[Trump is still undergoing historic turnover i...,Trump is still undergoing historic turnover in...
1431,Indexing Andrew Yang's Freedom Dividend,403,bb4ub9,https://www.reddit.com/r/NeutralPolitics/comme...,316,1.554822e+09,"Andrew Yang's Universal Basic Income or ""Freed...","Andrew Yang's Universal Basic Income or ""Freed..."
1432,Is the designation of the Iranian Islamic Revo...,479,bav0rl,https://www.reddit.com/r/NeutralPolitics/comme...,100,1.554765e+09,The Trump Administration has designated Iran's...,The Trump Administration has designated Iran's...


In [16]:
comments_data[comments_data['action'] == 'removecomment']

Unnamed: 0,action,content,author,details,submissionId,commentId,WordScore,WholeScore
1785,removecomment,"How is this""neutral news""? Literally, it's rep...",b756df5867ce042f3a07f3037e5eeeb9,Rule 5: top-level comment has no links,cgd7ut,5dea4ee94af43200093a2f4c,0.000000,0.912878
1786,removecomment,Just wondering if you have any updates.,58843f7430c71f72208766a295eaae5e,Low effort top-level comment,cim6kf,5dea4ee94af43200093a2f4d,0.000000,0.908528
1787,removecomment,.,f9dcef98ec140a8f44d5691d34081408,Rule 5: top-level comment has no links,cfzkky,5dea4ee94af43200093a2f4e,0.000000,0.867635
1788,removecomment,I went to park,aab8641b0dc26f59e9c8fea95f470138,Low effort top-level comment,asrdao,5dea4ee94af43200093a2f4f,0.000000,0.881256
1789,removecomment,"Hi, u/SFepicure\n\n[I think you may enjoy this...",e48fe0135108ee9985e260caf57cc6e0,remove,b47a1d,5dea4ee94af43200093a2f50,0.000000,0.849436
...,...,...,...,...,...,...,...,...
7222,removecomment,"There is no such thing as a ""divided governmen...",d30abee7e00bc527d05dae3ac65c0a1e,remove,j0ijwn,5f70e83cdbe1ef0009d976d9,0.459704,0.805176
7223,removecomment,Reddit will be happy about this one for sure.,7fc8a150ecdb1a39d007c28192daafbc,remove,j0ijwn,5f70e83cdbe1ef0009d976db,0.000000,0.809528
7226,removecomment,Oh I thought we were posting arbitrary facts t...,38e2216dc3dc2cb196cbf4bab0d83541,remove,j0ijwn,5f70e968dbe1ef0009d976e7,0.000000,0.838691
7227,removecomment,No. The senate is not wasting time to fill th...,c4f22a6532a4ca274c5027e4bc3d3d99,remove,j0ijwn,5f70e968dbe1ef0009d976e9,0.459704,0.840920


In [17]:
comments_data[comments_data['action'] == 'approvecomment']

Unnamed: 0,action,content,author,details,submissionId,commentId,WordScore,WholeScore
1810,approvecomment,Ahh I didn't know you knew the Bidens on a per...,2aca7285cb6f39201d3ea3e2888dc9b6,confirm_ham,cgkueu,5df947bb4af43200093c6c61,0.000000,0.792988
1811,approvecomment,These guys are career politicians. They are sm...,2aca7285cb6f39201d3ea3e2888dc9b6,confirm_ham,cgkueu,5df947bb4af43200093c6c62,0.442466,0.764901
1812,approvecomment,You caught me comrade. Russians are everywhere...,2aca7285cb6f39201d3ea3e2888dc9b6,confirm_ham,cgkueu,5df947bb4af43200093c6c63,0.251159,0.790013
1827,approvecomment,Thats because you haven't read a single articl...,1c6cfebeda2d8b70ba383ac8d1face94,confirm_ham,cgl6x6,5e15e5384af43200093cfded,0.273914,0.861274
1834,approvecomment,I think the subreddit has been greatly missed ...,7d255c78291615c6d715b194564574ad,unspam,fd2f6z,5e5ee75c4af43200093ec080,0.000000,0.857346
...,...,...,...,...,...,...,...,...
7221,approvecomment,I stand by the opinion that since we didn't ge...,b6026f2de045d703263813969c00c5c6,confirm_ham,j0ijwn,5f70e83cdbe1ef0009d976d8,0.170503,0.837942
7224,approvecomment,Judiciary hearing starts oct 12th. Fill the s...,c4f22a6532a4ca274c5027e4bc3d3d99,confirm_ham,j0ijwn,5f70e968dbe1ef0009d976e3,0.154348,0.846092
7225,approvecomment,It will last 3 to 4 days. This is a sham!\n\nh...,38e2216dc3dc2cb196cbf4bab0d83541,confirm_ham,j0ijwn,5f70e968dbe1ef0009d976e5,0.159531,0.817497
7228,approvecomment,"My point was not addressed, regarding providin...",062512f6041e1c40256c31f31a1138fc,unspam,j04eqv,5f70ee18dbe1ef0009d97717,1.281814,0.863294


In [18]:
comments_data

Unnamed: 0,action,content,author,details,submissionId,commentId,WordScore,WholeScore
0,,So what are the implications here? Does it onl...,Cody_Fox23,,4op948,d4eictg,0.000000,0.849655
1,,Sadly this isn't new. Police officers use many...,DrFrenchman,,4sef35,d58ts90,0.000000,0.900283
2,,What's disturbing about this is that our gover...,bbakks,,4sef35,d58y081,-0.038865,0.869078
3,,What I find really concerning is the horrible ...,poliscijunki,,4sef35,d5919n8,0.000000,0.898426
4,,This subject might have legs but this article ...,interweb1,,64zsim,dg6l969,0.000000,0.850127
...,...,...,...,...,...,...,...,...
11443,,"Yes, while in East Baghdad my platoons mission...",CapitalCockroach,,bav0rl,ekggrgk,1.070477,0.840028
11444,,The [definition the FBI currently uses for int...,CQME,,bav0rl,ekyelps,0.941533,0.882768
11445,,[Yes.](https://en.m.wikipedia.org/wiki/Islamic...,Silent_As_The_Grave_,,bav0rl,ekehcqg,0.217683,0.779386
11446,,Has ANY Shia ever committed an act of terroris...,bsmdphdjd,,bav0rl,ekfp4ls,1.293729,0.861529


In [41]:
random_topics_sample = topics_data.sample(n=3)

In [42]:
random_topics_sample

Unnamed: 0,title,score,id,url,comms_num,created,body,text
227,Ramirez goes deep twice as Indians beat Royals,3,hyf828,https://www.reuters.com/article/baseball-mlb-c...,6,1595828000.0,,jose ramirez hit two homers and drove in four ...
1364,What is the end-game of Oregon politicians den...,777,c4gs5j,https://www.reddit.com/r/NeutralPolitics/comme...,179,1561371000.0,So the long and short is that several Oregon l...,So the long and short is that several Oregon l...
948,Coronavirus cases in the Netherlands climb by ...,15,jfzn7g,https://www.reuters.com/article/health-coronav...,5,1603402000.0,,"AMSTERDAM, Oct 22 (Reuters) - The number of co..."


In [43]:
random_topics_sample['score'].mean()

265.0

In [57]:
#This code takes a bit too long...might be better to take mean instead of min (which has already been seen to run faster),
#lower the minimum requirement, or find a better way to sample with conditions instead of randomly sampling everytime
while True:
    random_topics_sample = topics_data.sample(n=10)
    if ((random_topics_sample['score'].min() > 100) and (random_topics_sample['comms_num'].min() > 50)):
        break
random_topics_sample

Unnamed: 0,title,score,id,url,comms_num,created,body,text
1393,Is the Trump Administration's use of the Espio...,504,bsw7e1,https://www.reddit.com/r/NeutralPolitics/comme...,229,1558829000.0,"In 1971, the Supreme Court ruled in favor of t...","In 1971, the Supreme Court ruled in favor of t..."
321,Pelosi calls House back to act on Postal Servi...,435,ib4jlg,https://www.axios.com/pelosi-house-postal-serv...,75,1597655000.0,,the house of representatives will be called ba...
1076,What are the benefits and limitations of Trica...,334,ijmdny,https://www.reddit.com/r/NeutralPolitics/comme...,88,1598856000.0,Tricameralism is a system of governance that i...,Tricameralism is a system of governance that i...
1222,"What are the context, precedent, and legality ...",588,ejs1f3,https://www.reddit.com/r/NeutralPolitics/comme...,183,1578143000.0,"On January 3, an American military drone [dest...","On January 3, an American military drone destr..."
341,More than 70 former GOP national security offi...,489,ie0flp,https://www.businessinsider.com/former-gop-nat...,136,1598058000.0,,more than 70 former republican national securi...
637,West Virginia Lawmaker Among Those Who Stormed...,454,ksbo1v,https://www.nytimes.com/2021/01/06/us/derrick-...,104,1610049000.0,,A newly elected lawmaker from West Virginia wa...
1232,2019 UK General Election Megathread,636,e9q72i,https://www.reddit.com/r/NeutralPolitics/comme...,265,1576199000.0,**I HAVE THE CONFIDENCE TO CALL A CONSERVATIVE...,I HAVE THE CONFIDENCE TO CALL A CONSERVATIVE M...
1139,What kind of legal action (if any) can foreign...,643,gx7e6g,https://www.reddit.com/r/NeutralPolitics/comme...,75,1591402000.0,There have been quite a few countries that hav...,There have been quite a few countries that hav...
1146,Is there any legal precedence for/against a pr...,661,gs8i90,https://www.reddit.com/r/NeutralPolitics/comme...,554,1590709000.0,According to [multiple](https://www.reuters.co...,"According to multiple sources, President Trump..."
257,Two Women Charged in Attack on Wisconsin State...,193,i28yme,https://www.nytimes.com/2020/07/29/us/wisconsi...,60,1596387000.0,,"he parked his car, and noticed a line of demon..."


In [62]:
subIds = random_topics_sample['id'].unique().tolist()
subIds

['bsw7e1',
 'ib4jlg',
 'ijmdny',
 'ejs1f3',
 'ie0flp',
 'ksbo1v',
 'e9q72i',
 'gx7e6g',
 'gs8i90',
 'i28yme']

In [68]:
random_comments_sample = pd.DataFrame(columns = comments_data.columns)
for i in subIds:
    frame = comments_data[comments_data['submissionId'] == i]
    frames = [random_comments_sample, frame]
    random_comments_sample = pd.concat(frames)

random_comments_sample

Unnamed: 0,action,content,author,details,submissionId,commentId,WordScore,WholeScore
11179,,Assange is being charged under the Espionage A...,FoolishFellow,,bsw7e1,eoriu20,1.118448,0.805833
11180,,1917 was a politically charged year. I'd love ...,gousey,,bsw7e1,eos8fsh,0.532586,0.832981
11181,,"It's really, really important, IMHO, to keep i...",CQME,,bsw7e1,ep2mov6,0.286061,0.810016
11182,,1. Yes. Julian Assange is arguably just a fore...,StuffyGoose,,bsw7e1,eos0cdd,0.919574,0.791466
11183,,"If it is unconstitutional, should that not be ...",MAK-15,,bsw7e1,eosi7vw,0.000000,0.823070
...,...,...,...,...,...,...,...,...
3597,removecomment,There is nothing neutral to be said about any ...,39b5538c9d3e209dc3c2202cf7ec933c,remove,i28yme,5f273e6bee857d0009964dc2,0.000000,0.869611
3599,removecomment,The veil of neutrality is routinely abused by ...,39b5538c9d3e209dc3c2202cf7ec933c,remove,i28yme,5f276770ee857d0009964f82,0.050985,0.841870
3600,removecomment,How is this demeaning language?,c4f22a6532a4ca274c5027e4bc3d3d99,Memes,i28yme,5f2770cfee857d0009964fb9,0.000000,0.799987
3603,approvecomment,"And yet, there is a graveyard of selectively d...",39b5538c9d3e209dc3c2202cf7ec933c,confirm_ham,i28yme,5f2813b3ee857d00099653e0,0.000000,0.896561


In [69]:
random_comments_sample['submissionId'].unique()

array(['bsw7e1', 'ib4jlg', 'ijmdny', 'ejs1f3', 'ie0flp', 'ksbo1v',
       'e9q72i', 'gx7e6g', 'gs8i90', 'i28yme'], dtype=object)

In [70]:
random_topics_sample['id'].unique()

array(['bsw7e1', 'ib4jlg', 'ijmdny', 'ejs1f3', 'ie0flp', 'ksbo1v',
       'e9q72i', 'gx7e6g', 'gs8i90', 'i28yme'], dtype=object)

In [79]:
import numpy as np
random_comments_sample['tfidf'] = np.nan

In [82]:
#Stuck on trying to figure out the best way to add the tfidf term to the dataframe...seems like it will be wildly
#inefficient to loop through anything
random_comments_sample['tfidf'] = tfidf(term, document, corpus)
random_comments_sample

NameError: name 'term' is not defined