# Term Frequency-Inverse Document Frequency

This feature will compare the frequencies of n-grams used in the comment to those used in the article. We could also augment this feature with WordNet to compare the frequencies of related words in both documents

The term will be defined as the n-grams in the comment. Term frequency will come from the frequency of a given term in the document that the comment is referring to. Inverse document frequency will be defined as the frequency of the term in all of the other articles that we have scraped. This could prove to be too much for a computer to perform in a reasonable amount of time. If this is the case, we can compare it to 10 other random articles or something like that

Going to first code a quick example of how TFIDF should work. The documents used and results garnered should match the wikipedia page for tfidf which can be found here: https://en.wikipedia.org/wiki/Tf%E2%80%93idf

In [8]:
from collections import Counter
import math

In [9]:
doc1 = 'this is a a sample'
doc2 = 'this is another another example example example'
corpus = [doc1, doc2]

In [10]:
def tf(term, document):
    term_arr = str.split(document)
    term_dict = Counter(term_arr)
    return term_dict[term] / len(term_arr)

In [11]:
def idf(term, corpus):
    numerator = len(corpus)
    count = 0
    for doc in corpus:
        term_arr = str.split(doc)
        if term in term_arr:
            count += 1
    if count == 0:
        return 0
    return math.log10(numerator/count)


In [12]:
def tfidf(term, document, corpus):
    return tf(term, document) * idf(term, corpus)

In [13]:
tfidf('sample', doc1, corpus)

0.06020599913279624

I think I can use this same exact code on actual reddit data. The term will be each word in a comment. Loop through every word, grab the tfidf score for each word in the comment, and sum these scores together to get the comment's tfidf score. Document is pure article text (need to look at what Sam did for word score comparisons to grab article text), corpus is just a list of the document variables (which, again, are just pure text)

In [14]:
import pandas as pd
topics_data = pd.read_csv('files/compiled_topics.csv')
comments_data = pd.read_csv('files/compiled_comments.csv')

In [15]:
random_topics_sample = topics_data.sample(n=3)

In [16]:
random_topics_sample

Unnamed: 0,title,score,id,url,comms_num,created,body,text
379,Democratic candidates for Baltimore mayor spen...,77,ijxzxa,https://apnews.com/24fada4cc4c1ebd8370f60caa74...,38,1598909000.0,,baltimore (ap) — the top six democratic candid...
1356,Are immigrants in US detention centers free to...,781,c8tsin,https://www.reddit.com/r/NeutralPolitics/comme...,339,1562217000.0,Conditions are very poor in U.S. immigration d...,Conditions are very poor in U.S. immigration d...
724,Attorney General William Barr to depart admini...,53,kd9801,https://www.nbcnews.com/politics/politics-news...,17,1608017000.0,,Attorney General William Barr will leave his p...


In [17]:
random_topics_sample['score'].mean()

303.6666666666667

In [18]:
#This code takes a bit too long...might be better to take mean instead of min (which has already been seen to run faster),
#lower the minimum requirement, or find a better way to sample with conditions instead of randomly sampling everytime
while True:
    random_topics_sample = topics_data.sample(n=10)
    if ((random_topics_sample['score'].min() > 100) and (random_topics_sample['comms_num'].min() > 50)):
        break
random_topics_sample

Unnamed: 0,title,score,id,url,comms_num,created,body,text
1318,Does the federal government possess the power ...,338,cr5yg0,https://www.reddit.com/r/NeutralPolitics/comme...,374,1565989000.0,"Yesterday, [Beto O'rourke](https://www.huffpos...","Yesterday, Beto O'rourke came out in favor of ..."
1275,What are the pros and cons of the US withdrawi...,505,dfxd42,https://www.reddit.com/r/NeutralPolitics/comme...,92,1570740000.0,What is the Open Skies Treaty? The treaty allo...,What is the Open Skies Treaty? The treaty allo...
233,How Local Covid Deaths Are Affecting Vote Choice,155,hzevob,https://www.nytimes.com/2020/07/28/upshot/poll...,56,1595973000.0,,to understand whether these community-level ex...
75,Hannity Says It’s ‘Despicable’ to Call for Pol...,479,bxtj3y,https://www.thedailybeast.com/hannity-says-its...,106,1559936000.0,,"apparently, “lock her up!” never happened.\n\n..."
70,Durbin: McConnell ignored election security be...,133,blxd09,https://www.politico.com/story/2019/05/07/dick...,74,1557298000.0,,"""he ignores the mueller report and our intelli..."
1313,Can the President Order Firms to Leave China?,557,cvbghy,https://www.reddit.com/r/NeutralPolitics/comme...,187,1566783000.0,US President Donald Trump [tweeted](https://tw...,US President Donald Trump tweeted on Friday th...
1004,What is the evidence supporting and refuting t...,838,kaubu5,https://www.reddit.com/r/NeutralPolitics/comme...,460,1607685000.0,[Donald Trump has requested](https://www.cnn.c...,Donald Trump has requested the Supreme Court o...
79,Donald Trump Jr. shared a racist tweet about K...,301,c75b2i,https://www.businessinsider.com/donald-trump-j...,89,1561875000.0,[deleted],"donald trump jr., the president's eldest son a..."
147,Supreme Court Rules State 'Faithless Elector' ...,212,hm8qvv,https://www.npr.org/2020/07/06/885168480/supre...,77,1594075000.0,,supreme court rules state 'faithless elector' ...
209,Trump consults Bush torture lawyer on how to s...,294,hv6rn4,https://www.theguardian.com/us-news/2020/jul/2...,69,1595365000.0,,the trump administration has been consulting t...


In [19]:
subIds = random_topics_sample['id'].unique().tolist()
subIds

['cr5yg0',
 'dfxd42',
 'hzevob',
 'bxtj3y',
 'blxd09',
 'cvbghy',
 'kaubu5',
 'c75b2i',
 'hm8qvv',
 'hv6rn4']

In [21]:
random_comments_sample = pd.DataFrame(columns = comments_data.columns)
for i in subIds:
    frame = comments_data[comments_data['submissionId'] == i]
    frames = [random_comments_sample, frame]
    random_comments_sample = pd.concat(frames)

random_comments_sample

Unnamed: 0,action,content,author,details,submissionId,commentId,WordScore,WholeScore
10734,,[/r/NeutralPolitics](https://www.reddit.com/r/...,tkc80,,cr5yg0,ex37105,0.109078,0.771502
10735,,"I'm going to assume that by ""assault weapon"" y...",,,cr5yg0,ex3d7io,1.120470,0.839213
10736,,"u/Chasicle and others have asked ""What is an a...",prime_23571113,,cr5yg0,ex3ki3m,1.218901,0.863967
10737,,"Robert Francis ""Beto"" O'Rourke and other Democ...",no112358,,cr5yg0,ex5ps5h,0.826000,0.778125
10738,,The biggest problem you're going to have with ...,tklite,,cr5yg0,ex3c04q,0.437754,0.861470
...,...,...,...,...,...,...,...,...
2843,removecomment,If people would start actually paying attentio...,dbb507dc2ef55cb83730e2a5e44a4805,remove,hv6rn4,5f1782be4af4320009455033,1.190177,0.884120
2844,approvecomment,I wouldnâ€™t say itâ€™s what aboutism. I would...,c5fdabdbfaa7e7a31c48eb11cd31c1eb,confirm_ham,hv6rn4,5f1785084af432000945503d,0.000611,0.868083
2875,removecomment,l,ae02c4ee90ec3927e8d113655b927e94,One-word response,hv6rn4,5f1876704af43200094558c2,0.000000,0.863451
3673,approvecomment,"And yet, these are the same people that preten...",e12d2abe6ac18532fc24fd610695a31c,confirm_ham,hv6rn4,5f2ae7eeee857d0009966cac,0.350250,0.891546


In [22]:
random_comments_sample['submissionId'].unique()

array(['cr5yg0', 'dfxd42', 'hzevob', 'bxtj3y', 'blxd09', 'cvbghy',
       'kaubu5', 'c75b2i', 'hm8qvv', 'hv6rn4'], dtype=object)

In [23]:
random_topics_sample['id'].unique()

array(['cr5yg0', 'dfxd42', 'hzevob', 'bxtj3y', 'blxd09', 'cvbghy',
       'kaubu5', 'c75b2i', 'hm8qvv', 'hv6rn4'], dtype=object)

In [25]:
random_comments_sample.to_csv("files/random_comments_sample216.csv", index=False)
random_topics_sample.to_csv("files/random_topics_sample216.csv", index=False)

### To avoid having to read in more data, feel free to start running code from here onward
My plan is to first run the program and see if it works on this small sample size and then pull in the large dataframe and see how that goes

In [57]:
random_comments_sample = pd.read_csv("files/random_comments_sample216.csv")
random_topics_sample = pd.read_csv("files/random_topics_sample216.csv")

import numpy as np
random_comments_sample['tfidf'] = np.nan

In [58]:
#Remember that sometimes the url is actually just a link back to the main reddit thread and in that case we will be
#performing tfidf on whatever the first user posted against comments, not articles

In [66]:
#Stuck on trying to figure out the best way to add the tfidf term to the dataframe...seems like it will be wildly
#inefficient to loop through anything
corpus = []
id_list = []
for index, topic in random_topics_sample.iterrows():
    corpus.append(topic['text'])
    id_list.append(topic['id'])
    
tfidf_list = []
    
for index, comment in random_comments_sample.iterrows():
    tfidf_sum = 0
    tfidf_current = 0
    doc = random_topics_sample[random_topics_sample['id'] == comment['submissionId']]['text']
    doc = str(doc)
    term = comment['content']
    term_arr = str.split(term)
    for element in term_arr:
        tfidf_current = tfidf(element, doc, corpus)
        tfidf_sum += tfidf_current
    tfidf_list.append(tfidf_sum)
    
    
random_comments_sample['tfidf'] = tfidf_list

##### The above code works for the random sampling of comments but it already takes a bit of time. Now I'm going to try to make it work on the full compiled comments and compiled topics data frame as well as turn it into a function

In [81]:
import numpy as np
import pandas as pd
from collections import Counter
import math

def tf(term, document):
    document = str(document)
    term_arr = str.split(document)
    term_dict = Counter(term_arr)
    return term_dict[term] / len(term_arr)

def idf(term, corpus):
    numerator = len(corpus)
    count = 0
    for doc in corpus:
        doc = str(doc)
        term_arr = str.split(doc)
        if term in term_arr:
            count += 1
    if count == 0:
        return 0
    return math.log10(numerator/count)

def tfidf(term, document, corpus):
    return tf(term, document) * idf(term, corpus)

def tfidf_on_dataset(topics_df, comments_df):
    corpus = []
    id_list = []
    for index, topic in topics_df.iterrows():
        corpus.append(topic['text'])
        id_list.append(topic['id'])
    
    tfidf_list = []
    
    for index, comment in comments_df.iterrows():
        tfidf_sum = 0
        tfidf_current = 0
        doc = topics_df[topics_df['id'] == comment['submissionId']]['text']
        doc = str(doc)
        term = comment['content']
        term = str(term)
        term_arr = str.split(term)
        for element in term_arr:
            tfidf_current = tfidf(element, doc, corpus)
            tfidf_sum += tfidf_current
        tfidf_list.append(tfidf_sum)
        print(str((index/(len(comments_df.index)))*100) + "% complete")
    return tfidf_list
    
random_comments_sample['tfidf'] = tfidf_on_dataset(random_topics_sample, random_comments_sample)
random_comments_sample

0.0% complete
0.684931506849315% complete
1.36986301369863% complete
2.054794520547945% complete
2.73972602739726% complete
3.4246575342465753% complete
4.10958904109589% complete
4.794520547945205% complete
5.47945205479452% complete
6.164383561643835% complete
6.8493150684931505% complete
7.534246575342466% complete
8.21917808219178% complete
8.904109589041095% complete
9.58904109589041% complete
10.273972602739725% complete
10.95890410958904% complete
11.643835616438356% complete
12.32876712328767% complete
13.013698630136986% complete
13.698630136986301% complete
14.383561643835616% complete
15.068493150684931% complete
15.753424657534246% complete
16.43835616438356% complete
17.123287671232877% complete
17.80821917808219% complete
18.493150684931507% complete
19.17808219178082% complete
19.863013698630137% complete
20.54794520547945% complete
21.232876712328768% complete
21.91780821917808% complete
22.602739726027394% complete
23.28767123287671% complete
23.972602739726025% comple

Unnamed: 0,action,content,author,details,submissionId,commentId,WordScore,WholeScore,tfidf
0,,[/r/NeutralPolitics](https://www.reddit.com/r/...,tkc80,,cr5yg0,ex37105,0.109078,0.771502,0.003268
1,,"I'm going to assume that by ""assault weapon"" y...",,,cr5yg0,ex3d7io,1.120470,0.839213,0.031693
2,,"u/Chasicle and others have asked ""What is an a...",prime_23571113,,cr5yg0,ex3ki3m,1.218901,0.863967,0.083987
3,,"Robert Francis ""Beto"" O'Rourke and other Democ...",no112358,,cr5yg0,ex5ps5h,0.826000,0.778125,0.009805
4,,The biggest problem you're going to have with ...,tklite,,cr5yg0,ex3c04q,0.437754,0.861470,0.009805
...,...,...,...,...,...,...,...,...,...
141,removecomment,If people would start actually paying attentio...,dbb507dc2ef55cb83730e2a5e44a4805,remove,hv6rn4,5f1782be4af4320009455033,1.190177,0.884120,0.012908
142,approvecomment,I wouldnâ€™t say itâ€™s what aboutism. I would...,c5fdabdbfaa7e7a31c48eb11cd31c1eb,confirm_ham,hv6rn4,5f1785084af432000945503d,0.000611,0.868083,0.000000
143,removecomment,l,ae02c4ee90ec3927e8d113655b927e94,One-word response,hv6rn4,5f1876704af43200094558c2,0.000000,0.863451,0.000000
144,approvecomment,"And yet, these are the same people that preten...",e12d2abe6ac18532fc24fd610695a31c,confirm_ham,hv6rn4,5f2ae7eeee857d0009966cac,0.350250,0.891546,0.000000


#### The above code will work on a full dataset...but I'm not sure how long it would take to run, so let's try it

In [82]:
full_topics_df = pd.read_csv("files/compiled_topics.csv")
full_comments_df = pd.read_csv("files/compiled_comments.csv")

In [83]:
full_comments_df['tfidf'] = tfidf_on_dataset(full_topics_df, full_comments_df)
full_comments_df

0.0% complete
0.008735150244584206% complete
0.017470300489168412% complete
0.026205450733752623% complete
0.034940600978336823% complete
0.043675751222921035% complete
0.052410901467505246% complete
0.06114605171208945% complete
0.06988120195667365% complete
0.07861635220125787% complete
0.08735150244584207% complete
0.09608665269042627% complete
0.10482180293501049% complete
0.11355695317959467% complete
0.1222921034241789% complete
0.1310272536687631% complete
0.1397624039133473% complete
0.1484975541579315% complete
0.15723270440251574% complete
0.16596785464709993% complete
0.17470300489168414% complete
0.18343815513626835% complete
0.19217330538085253% complete
0.20090845562543677% complete
0.20964360587002098% complete
0.2183787561146052% complete
0.22711390635918935% complete
0.2358490566037736% complete
0.2445842068483578% complete
0.253319357092942% complete
0.2620545073375262% complete
0.2707896575821104% complete
0.2795248078266946% complete
0.2882599580712788% complete
0.2

2.489517819706499% complete
2.498252969951083% complete
2.5069881201956674% complete
2.515723270440252% complete
2.524458420684836% complete
2.53319357092942% complete
2.541928721174004% complete
2.5506638714185885% complete
2.5593990216631726% complete
2.5681341719077566% complete
2.576869322152341% complete
2.585604472396925% complete
2.5943396226415096% complete
2.6030747728860937% complete
2.6118099231306777% complete
2.6205450733752618% complete
2.6292802236198463% complete
2.6380153738644307% complete
2.646750524109015% complete
2.655485674353599% complete
2.664220824598183% complete
2.6729559748427674% complete
2.6816911250873514% complete
2.6904262753319355% complete
2.69916142557652% complete
2.7078965758211044% complete
2.7166317260656885% complete
2.7253668763102725% complete
2.7341020265548566% complete
2.742837176799441% complete
2.751572327044025% complete
2.7603074772886096% complete
2.7690426275331936% complete
2.7777777777777777% complete
2.786512928022362% complete
2.

5.022711390635918% complete
5.031446540880504% complete
5.040181691125087% complete
5.048916841369672% complete
5.0576519916142555% complete
5.06638714185884% complete
5.0751222921034245% complete
5.083857442348008% complete
5.092592592592593% complete
5.101327742837177% complete
5.1100628930817615% complete
5.118798043326345% complete
5.12753319357093% complete
5.136268343815513% complete
5.145003494060098% complete
5.153738644304682% complete
5.162473794549266% complete
5.17120894479385% complete
5.179944095038435% complete
5.188679245283019% complete
5.197414395527603% complete
5.206149545772187% complete
5.214884696016772% complete
5.2236198462613554% complete
5.23235499650594% complete
5.2410901467505235% complete
5.249825296995108% complete
5.2585604472396925% complete
5.267295597484277% complete
5.2760307477288615% complete
5.284765897973445% complete
5.29350104821803% complete
5.302236198462613% complete
5.310971348707198% complete
5.319706498951781% complete
5.328441649196366%

7.582110412299091% complete
7.5908455625436755% complete
7.599580712788261% complete
7.6083158630328445% complete
7.617051013277429% complete
7.6257861635220126% complete
7.634521313766597% complete
7.643256464011181% complete
7.651991614255765% complete
7.660726764500349% complete
7.669461914744933% complete
7.678197064989519% complete
7.686932215234102% complete
7.695667365478687% complete
7.70440251572327% complete
7.713137665967855% complete
7.721872816212438% complete
7.730607966457023% complete
7.7393431167016065% complete
7.748078266946192% complete
7.756813417190776% complete
7.76554856743536% complete
7.774283717679944% complete
7.783018867924528% complete
7.7917540181691125% complete
7.800489168413696% complete
7.809224318658281% complete
7.817959468902864% complete
7.82669461914745% complete
7.835429769392034% complete
7.844164919636618% complete
7.852900069881202% complete
7.861635220125786% complete
7.87037037037037% complete
7.879105520614954% complete
7.887840670859539% 

10.158979734451433% complete
10.167714884696016% complete
10.1764500349406% complete
10.185185185185185% complete
10.19392033542977% complete
10.202655485674354% complete
10.211390635918939% complete
10.220125786163523% complete
10.228860936408106% complete
10.23759608665269% complete
10.246331236897275% complete
10.25506638714186% complete
10.263801537386442% complete
10.272536687631026% complete
10.281271837875613% complete
10.290006988120195% complete
10.29874213836478% complete
10.307477288609364% complete
10.316212438853949% complete
10.324947589098532% complete
10.333682739343116% complete
10.3424178895877% complete
10.351153039832285% complete
10.35988819007687% complete
10.368623340321454% complete
10.377358490566039% complete
10.386093640810621% complete
10.394828791055206% complete
10.40356394129979% complete
10.412299091544375% complete
10.421034241788957% complete
10.429769392033544% complete
10.438504542278128% complete
10.447239692522711% complete
10.455974842767295% comp

12.657232704402515% complete
12.6659678546471% complete
12.674703004891686% complete
12.683438155136267% complete
12.692173305380852% complete
12.700908455625438% complete
12.709643605870022% complete
12.718378756114603% complete
12.72711390635919% complete
12.735849056603774% complete
12.744584206848359% complete
12.753319357092943% complete
12.762054507337526% complete
12.77078965758211% complete
12.779524807826695% complete
12.78825995807128% complete
12.796995108315862% complete
12.805730258560446% complete
12.814465408805031% complete
12.823200559049615% complete
12.831935709294202% complete
12.840670859538783% complete
12.849406009783367% complete
12.858141160027953% complete
12.866876310272538% complete
12.875611460517119% complete
12.884346610761705% complete
12.89308176100629% complete
12.901816911250874% complete
12.910552061495459% complete
12.919287211740041% complete
12.928022361984626% complete
12.93675751222921% complete
12.945492662473795% complete
12.954227812718377% c

15.146750524109015% complete
15.1554856743536% complete
15.164220824598182% complete
15.172955974842766% complete
15.181691125087351% complete
15.190426275331937% complete
15.199161425576522% complete
15.207896575821103% complete
15.216631726065689% complete
15.225366876310273% complete
15.234102026554858% complete
15.24283717679944% complete
15.251572327044025% complete
15.26030747728861% complete
15.269042627533194% complete
15.277777777777779% complete
15.286512928022361% complete
15.295248078266946% complete
15.30398322851153% complete
15.312718378756115% complete
15.321453529000697% complete
15.330188679245282% complete
15.338923829489866% complete
15.347658979734453% complete
15.356394129979037% complete
15.365129280223618% complete
15.373864430468204% complete
15.382599580712789% complete
15.391334730957373% complete
15.400069881201956% complete
15.40880503144654% complete
15.417540181691125% complete
15.42627533193571% complete
15.435010482180294% complete
15.443745632424877% c

TypeError: descriptor 'split' requires a 'str' object but received a 'float'

4:40 start time for above