# Term Frequency-Inverse Document Frequency

This feature will compare the frequencies of n-grams used in the comment to those used in the article. We could also augment this feature with WordNet to compare the frequencies of related words in both documents

The term will be defined as the n-grams in the comment. Term frequency will come from the frequency of a given term in the document that the comment is referring to. Inverse document frequency will be defined as the frequency of the term in all of the other articles that we have scraped. This could prove to be too much for a computer to perform in a reasonable amount of time. If this is the case, we can compare it to 10 other random articles or something like that

Going to first code a quick example of how TFIDF should work. The documents used and results garnered should match the wikipedia page for tfidf which can be found here: https://en.wikipedia.org/wiki/Tf%E2%80%93idf

In [2]:
from collections import Counter
import math

In [9]:
doc1 = 'this is a a sample'
doc2 = 'this is another another example example example'
corpus = [doc1, doc2]

In [10]:
def tf(term, document):
    term_arr = str.split(document)
    term_dict = Counter(term_arr)
    return term_dict[term] / len(term_arr)

In [11]:
def idf(term, corpus):
    numerator = len(corpus)
    count = 0
    for doc in corpus:
        term_arr = str.split(doc)
        if term in term_arr:
            count += 1
    if count == 0:
        return 0
    return math.log10(numerator/count)


In [12]:
def tfidf(term, document, corpus):
    return tf(term, document) * idf(term, corpus)

In [13]:
tfidf('sample', doc1, corpus)

0.06020599913279624

I think I can use this same exact code on actual reddit data. The term will be each word in a comment. Loop through every word, grab the tfidf score for each word in the comment, and sum these scores together to get the comment's tfidf score. Document is pure article text (need to look at what Sam did for word score comparisons to grab article text), corpus is just a list of the document variables (which, again, are just pure text)

In [14]:
import pandas as pd
topics_data = pd.read_csv('files/compiled_topics.csv')
comments_data = pd.read_csv('files/compiled_comments.csv')

In [15]:
random_topics_sample = topics_data.sample(n=3)

In [16]:
random_topics_sample

Unnamed: 0,title,score,id,url,comms_num,created,body,text
379,Democratic candidates for Baltimore mayor spen...,77,ijxzxa,https://apnews.com/24fada4cc4c1ebd8370f60caa74...,38,1598909000.0,,baltimore (ap) — the top six democratic candid...
1356,Are immigrants in US detention centers free to...,781,c8tsin,https://www.reddit.com/r/NeutralPolitics/comme...,339,1562217000.0,Conditions are very poor in U.S. immigration d...,Conditions are very poor in U.S. immigration d...
724,Attorney General William Barr to depart admini...,53,kd9801,https://www.nbcnews.com/politics/politics-news...,17,1608017000.0,,Attorney General William Barr will leave his p...


In [17]:
random_topics_sample['score'].mean()

303.6666666666667

In [18]:
#This code takes a bit too long...might be better to take mean instead of min (which has already been seen to run faster),
#lower the minimum requirement, or find a better way to sample with conditions instead of randomly sampling everytime
while True:
    random_topics_sample = topics_data.sample(n=10)
    if ((random_topics_sample['score'].min() > 100) and (random_topics_sample['comms_num'].min() > 50)):
        break
random_topics_sample

Unnamed: 0,title,score,id,url,comms_num,created,body,text
1318,Does the federal government possess the power ...,338,cr5yg0,https://www.reddit.com/r/NeutralPolitics/comme...,374,1565989000.0,"Yesterday, [Beto O'rourke](https://www.huffpos...","Yesterday, Beto O'rourke came out in favor of ..."
1275,What are the pros and cons of the US withdrawi...,505,dfxd42,https://www.reddit.com/r/NeutralPolitics/comme...,92,1570740000.0,What is the Open Skies Treaty? The treaty allo...,What is the Open Skies Treaty? The treaty allo...
233,How Local Covid Deaths Are Affecting Vote Choice,155,hzevob,https://www.nytimes.com/2020/07/28/upshot/poll...,56,1595973000.0,,to understand whether these community-level ex...
75,Hannity Says It’s ‘Despicable’ to Call for Pol...,479,bxtj3y,https://www.thedailybeast.com/hannity-says-its...,106,1559936000.0,,"apparently, “lock her up!” never happened.\n\n..."
70,Durbin: McConnell ignored election security be...,133,blxd09,https://www.politico.com/story/2019/05/07/dick...,74,1557298000.0,,"""he ignores the mueller report and our intelli..."
1313,Can the President Order Firms to Leave China?,557,cvbghy,https://www.reddit.com/r/NeutralPolitics/comme...,187,1566783000.0,US President Donald Trump [tweeted](https://tw...,US President Donald Trump tweeted on Friday th...
1004,What is the evidence supporting and refuting t...,838,kaubu5,https://www.reddit.com/r/NeutralPolitics/comme...,460,1607685000.0,[Donald Trump has requested](https://www.cnn.c...,Donald Trump has requested the Supreme Court o...
79,Donald Trump Jr. shared a racist tweet about K...,301,c75b2i,https://www.businessinsider.com/donald-trump-j...,89,1561875000.0,[deleted],"donald trump jr., the president's eldest son a..."
147,Supreme Court Rules State 'Faithless Elector' ...,212,hm8qvv,https://www.npr.org/2020/07/06/885168480/supre...,77,1594075000.0,,supreme court rules state 'faithless elector' ...
209,Trump consults Bush torture lawyer on how to s...,294,hv6rn4,https://www.theguardian.com/us-news/2020/jul/2...,69,1595365000.0,,the trump administration has been consulting t...


In [19]:
subIds = random_topics_sample['id'].unique().tolist()
subIds

['cr5yg0',
 'dfxd42',
 'hzevob',
 'bxtj3y',
 'blxd09',
 'cvbghy',
 'kaubu5',
 'c75b2i',
 'hm8qvv',
 'hv6rn4']

In [21]:
random_comments_sample = pd.DataFrame(columns = comments_data.columns)
for i in subIds:
    frame = comments_data[comments_data['submissionId'] == i]
    frames = [random_comments_sample, frame]
    random_comments_sample = pd.concat(frames)

random_comments_sample

Unnamed: 0,action,content,author,details,submissionId,commentId,WordScore,WholeScore
10734,,[/r/NeutralPolitics](https://www.reddit.com/r/...,tkc80,,cr5yg0,ex37105,0.109078,0.771502
10735,,"I'm going to assume that by ""assault weapon"" y...",,,cr5yg0,ex3d7io,1.120470,0.839213
10736,,"u/Chasicle and others have asked ""What is an a...",prime_23571113,,cr5yg0,ex3ki3m,1.218901,0.863967
10737,,"Robert Francis ""Beto"" O'Rourke and other Democ...",no112358,,cr5yg0,ex5ps5h,0.826000,0.778125
10738,,The biggest problem you're going to have with ...,tklite,,cr5yg0,ex3c04q,0.437754,0.861470
...,...,...,...,...,...,...,...,...
2843,removecomment,If people would start actually paying attentio...,dbb507dc2ef55cb83730e2a5e44a4805,remove,hv6rn4,5f1782be4af4320009455033,1.190177,0.884120
2844,approvecomment,I wouldnâ€™t say itâ€™s what aboutism. I would...,c5fdabdbfaa7e7a31c48eb11cd31c1eb,confirm_ham,hv6rn4,5f1785084af432000945503d,0.000611,0.868083
2875,removecomment,l,ae02c4ee90ec3927e8d113655b927e94,One-word response,hv6rn4,5f1876704af43200094558c2,0.000000,0.863451
3673,approvecomment,"And yet, these are the same people that preten...",e12d2abe6ac18532fc24fd610695a31c,confirm_ham,hv6rn4,5f2ae7eeee857d0009966cac,0.350250,0.891546


In [22]:
random_comments_sample['submissionId'].unique()

array(['cr5yg0', 'dfxd42', 'hzevob', 'bxtj3y', 'blxd09', 'cvbghy',
       'kaubu5', 'c75b2i', 'hm8qvv', 'hv6rn4'], dtype=object)

In [23]:
random_topics_sample['id'].unique()

array(['cr5yg0', 'dfxd42', 'hzevob', 'bxtj3y', 'blxd09', 'cvbghy',
       'kaubu5', 'c75b2i', 'hm8qvv', 'hv6rn4'], dtype=object)

In [25]:
random_comments_sample.to_csv("files/random_comments_sample216.csv", index=False)
random_topics_sample.to_csv("files/random_topics_sample216.csv", index=False)

### To avoid having to read in more data, feel free to start running code from here onward
My plan is to first run the program and see if it works on this small sample size and then pull in the large dataframe and see how that goes

In [57]:
random_comments_sample = pd.read_csv("files/random_comments_sample216.csv")
random_topics_sample = pd.read_csv("files/random_topics_sample216.csv")

import numpy as np
random_comments_sample['tfidf'] = np.nan

In [58]:
#Remember that sometimes the url is actually just a link back to the main reddit thread and in that case we will be
#performing tfidf on whatever the first user posted against comments, not articles

In [66]:
#Stuck on trying to figure out the best way to add the tfidf term to the dataframe...seems like it will be wildly
#inefficient to loop through anything
corpus = []
id_list = []
for index, topic in random_topics_sample.iterrows():
    corpus.append(topic['text'])
    id_list.append(topic['id'])
    
tfidf_list = []
    
for index, comment in random_comments_sample.iterrows():
    tfidf_sum = 0
    tfidf_current = 0
    doc = random_topics_sample[random_topics_sample['id'] == comment['submissionId']]['text']
    doc = str(doc)
    term = comment['content']
    term_arr = str.split(term)
    for element in term_arr:
        tfidf_current = tfidf(element, doc, corpus)
        tfidf_sum += tfidf_current
    tfidf_list.append(tfidf_sum)
    
    
random_comments_sample['tfidf'] = tfidf_list

##### The above code works for the random sampling of comments but it already takes a bit of time. Now I'm going to try to make it work on the full compiled comments and compiled topics data frame as well as turn it into a function

In [18]:
import numpy as np
import pandas as pd
from collections import Counter
import math

def tf(term, document):
    document = str(document)
    term_arr = str.split(document)
    term_dict = Counter(term_arr)
    return term_dict[term] / len(term_arr)

def idf(term, corpus):
    numerator = len(corpus)
    count = 0
    for doc in corpus:
        doc = str(doc)
        term_arr = str.split(doc)
        if term in term_arr:
            count += 1
    if count == 0:
        return 0
    return math.log10(numerator/count)

def tfidf(term, document, corpus):
    return tf(term, document) * idf(term, corpus)

def tfidf_on_dataset(topics_df, comments_df):
    corpus = []
    id_list = []
    for index, topic in topics_df.iterrows():
        corpus.append(topic['text'])
        id_list.append(topic['id'])
    
    tfidf_list = []
    
    for index, comment in comments_df.iterrows():
        tfidf_sum = 0
        tfidf_current = 0
        doc = topics_df[topics_df['id'] == comment['submissionId']]['text']
        doc = str(doc)
        term = comment['content']
        term = str(term)
        term_arr = str.split(term)
        for element in term_arr:
            tfidf_current = tfidf(element, doc, corpus)
            tfidf_sum += tfidf_current
        tfidf_list.append(tfidf_sum)
        print(str((index/(len(comments_df.index)))*100) + "% complete")
    return tfidf_list

#### The above code will work on a full dataset...but I'm not sure how long it would take to run, so let's try it

In [19]:
full_topics_df = pd.read_csv("files/compiled_topics.csv")
full_comments_df = pd.read_csv("files/compiled_comments_3_14_2021.csv")
full_comments_df = full_comments_df.drop(['Unnamed: 0'], axis=1)
full_comments_df

Unnamed: 0,action,content,author,details,submissionId,commentId,WordScore,WholeScore,tfidf,contains_url,...,contains_!,no_url_WordScore,no_url_WholeScore,WordScoreNoStop,WholeScoreNoStop,no_url_or_stops_WholeScore,no_url_or_stops_WordScore,no_url_or_stops_content,NER_count,NER_match
0,True,So what are the implications here? Does it onl...,Cody_Fox23,,4op948,d4eictg,0.000000,0.849655,0.001573,False,...,False,0.000000,0.816813,0.000000,0.773069,0.736582,0.000000,So implications here? Does affect involved Vis...,0,0
1,True,Sadly this isn't new. Police officers use many...,DrFrenchman,,4sef35,d58ts90,0.000000,0.900283,0.255802,False,...,True,0.000000,0.884829,0.000000,0.857654,0.844658,0.000000,Sadly isn't new. Police officers use faulty te...,0,0
2,True,What's disturbing about this is that our gover...,bbakks,,4sef35,d58y081,-0.038865,0.869078,0.000000,False,...,False,-0.038865,0.866455,-0.038865,0.833865,0.785302,-0.038865,What's disturbing government destroying lives ...,1,0
3,True,What I find really concerning is the horrible ...,poliscijunki,,4sef35,d5919n8,0.000000,0.898426,0.000000,True,...,False,0.000000,0.884435,0.000000,0.865826,0.852412,0.000000,What I find concerning horrible response law e...,1,0
4,True,This subject might have legs but this article ...,interweb1,,64zsim,dg6l969,0.000000,0.850127,0.000000,False,...,False,0.000000,0.835723,0.000000,0.826162,0.804306,0.000000,This subject legs article opinion piece editor...,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10231,True,"Yes, while in East Baghdad my platoons mission...",CapitalCockroach,,bav0rl,ekggrgk,1.070477,0.840028,0.000000,True,...,False,1.070477,0.831097,1.000532,0.827872,0.788655,1.000532,"Yes, East Baghdad platoons mission check build...",3,0
10232,True,The [definition the FBI currently uses for int...,CQME,,bav0rl,ekyelps,0.941533,0.882768,0.217543,True,...,False,0.884132,0.870870,0.606157,0.852373,0.843292,0.600762,The [definition FBI currently uses internation...,3,0
10233,True,[Yes.](https://en.m.wikipedia.org/wiki/Islamic...,Silent_As_The_Grave_,,bav0rl,ekehcqg,0.217683,0.779386,0.000000,True,...,False,0.217683,0.833056,0.369440,0.782545,0.800717,0.369440,[Yes.] Have look allies with. Hezbollah fucks ...,1,0
10234,True,Has ANY Shia ever committed an act of terroris...,bsmdphdjd,,bav0rl,ekfp4ls,1.293729,0.861529,0.000000,False,...,False,1.293729,0.847163,1.788263,0.834425,0.792615,1.788263,Has ANY Shia committed act terrorism U.S.?\n\n...,4,0


In [20]:
def df(topics_df, comments_df):
    column_names = ['term', 'idf']
    df_df = pd.DataFrame(columns = column_names)
    
    corpus = []
    for index, topic in topics_df.iterrows():
        corpus.append(topic['text'])
    
    for index, comment in comments_df.iterrows():
        term = comment['content']
        term = str(term)
        term_arr = str.split(term)
        for element in term_arr:
            element = str(element)
            if element in df_df.values:
                continue
            else:
                df_val = idf(element, corpus)
                df2 = pd.DataFrame([[element, df_val]], columns = column_names)
                df_df = df_df.append(df2)
        print(str((index/(len(comments_df.index)))*100) + "% complete")
    df_df.to_csv("files/tfidf_ss.csv", index=False)
    
df(full_topics_df, full_comments_df)
            

0.0% complete
0.009769441187964049% complete
0.019538882375928098% complete
0.029308323563892142% complete
0.039077764751856196% complete
0.04884720593982024% complete
0.058616647127784284% complete
0.06838608831574833% complete
0.07815552950371239% complete
0.08792497069167644% complete
0.09769441187964048% complete
0.10746385306760453% complete
0.11723329425556857% complete
0.12700273544353263% complete
0.13677217663149666% complete
0.14654161781946073% complete
0.15631105900742479% complete
0.16608050019538884% complete
0.1758499413833529% complete
0.1856193825713169% complete
0.19538882375928096% complete
0.20515826494724504% complete
0.21492770613520906% complete
0.2246971473231731% complete
0.23446658851113714% complete
0.24423602969910121% complete
0.25400547088706527% complete
0.2637749120750293% complete
0.2735443532629933% complete
0.2833137944509574% complete
0.29308323563892147% complete
0.3028526768268855% complete
0.31262211801484957% complete
0.3223915592028136% complete

2.7842907385697537% complete
2.7940601797577176% complete
2.803829620945682% complete
2.813599062133646% complete
2.82336850332161% complete
2.8331379445095743% complete
2.842907385697538% complete
2.852676826885502% complete
2.8624462680734664% complete
2.87221570926143% complete
2.8819851504493945% complete
2.8917545916373584% complete
2.9015240328253222% complete
2.9112934740132865% complete
2.9210629152012504% complete
2.9308323563892147% complete
2.9406017975771785% complete
2.9503712387651424% complete
2.9601406799531067% complete
2.9699101211410706% complete
2.979679562329035% complete
2.9894490035169987% complete
2.999218444704963% complete
3.008987885892927% complete
3.0187573270808907% complete
3.028526768268855% complete
3.038296209456819% complete
3.048065650644783% complete
3.057835091832747% complete
3.067604533020711% complete
3.077373974208675% complete
3.087143415396639% complete
3.0969128565846034% complete
3.1066822977725677% complete
3.1164517389605315% complete
3.1

5.627198124267292% complete
5.636967565455256% complete
5.64673700664322% complete
5.656506447831184% complete
5.666275889019149% complete
5.676045330207112% complete
5.685814771395076% complete
5.69558421258304% complete
5.705353653771004% complete
5.715123094958968% complete
5.724892536146933% complete
5.734661977334897% complete
5.74443141852286% complete
5.754200859710824% complete
5.763970300898789% complete
5.773739742086752% complete
5.783509183274717% complete
5.793278624462681% complete
5.8030480656506445% complete
5.812817506838608% complete
5.822586948026573% complete
5.832356389214537% complete
5.842125830402501% complete
5.851895271590465% complete
5.861664712778429% complete
5.871434153966392% complete
5.881203595154357% complete
5.890973036342322% complete
5.900742477530285% complete
5.9105119187182495% complete
5.920281359906213% complete
5.930050801094177% complete
5.939820242282141% complete
5.949589683470106% complete
5.95935912465807% complete
5.969128565846034% com

8.489644392340757% complete
8.499413833528722% complete
8.509183274716687% complete
8.51895271590465% complete
8.528722157092613% complete
8.53849159828058% complete
8.548261039468542% complete
8.558030480656505% complete
8.567799921844472% complete
8.577569363032435% complete
8.587338804220398% complete
8.597108245408362% complete
8.606877686596327% complete
8.61664712778429% complete
8.626416568972255% complete
8.636186010160218% complete
8.645955451348183% complete
8.655724892536147% complete
8.66549433372411% complete
8.675263774912075% complete
8.68503321610004% complete
8.694802657288003% complete
8.704572098475968% complete
8.71434153966393% complete
8.724110980851895% complete
8.73388042203986% complete
8.743649863227823% complete
8.753419304415786% complete
8.763188745603752% complete
8.772958186791715% complete
8.782727627979678% complete
8.792497069167643% complete
8.802266510355608% complete
8.812035951543571% complete
8.821805392731536% complete
8.8315748339195% complete
8

11.381398983978116% complete
11.39116842516608% complete
11.400937866354043% complete
11.410707307542008% complete
11.420476748729973% complete
11.430246189917936% complete
11.4400156311059% complete
11.449785072293865% complete
11.459554513481828% complete
11.469323954669793% complete
11.479093395857758% complete
11.48886283704572% complete
11.498632278233686% complete
11.508401719421649% complete
11.518171160609613% complete
11.527940601797578% complete
11.537710042985541% complete
11.547479484173504% complete
11.55724892536147% complete
11.567018366549433% complete
11.576787807737396% complete
11.586557248925361% complete
11.596326690113326% complete
11.606096131301289% complete
11.615865572489254% complete
11.625635013677217% complete
11.635404454865181% complete
11.645173896053146% complete
11.654943337241109% complete
11.664712778429074% complete
11.674482219617039% complete
11.684251660805002% complete
11.694021101992966% complete
11.70379054318093% complete
11.713559984368894% 

14.175459163735834% complete
14.1852286049238% complete
14.194998046111762% complete
14.204767487299726% complete
14.21453692848769% complete
14.224306369675654% complete
14.234075810863619% complete
14.243845252051582% complete
14.253614693239546% complete
14.263384134427511% complete
14.273153575615474% complete
14.282923016803439% complete
14.292692457991402% complete
14.302461899179367% complete
14.312231340367331% complete
14.322000781555294% complete
14.331770222743259% complete
14.341539663931224% complete
14.351309105119187% complete
14.361078546307152% complete
14.370847987495114% complete
14.38061742868308% complete
14.390386869871044% complete
14.400156311059007% complete
14.409925752246972% complete
14.419695193434936% complete
14.4294646346229% complete
14.439234075810864% complete
14.449003516998829% complete
14.458772958186792% complete
14.468542399374757% complete
14.478311840562718% complete
14.488081281750684% complete
14.497850722938649% complete
14.50762016412661% c

16.969519343493552% complete
16.979288784681515% complete
16.98905822586948% complete
16.998827667057444% complete
17.008597108245407% complete
17.018366549433374% complete
17.028135990621337% complete
17.0379054318093% complete
17.047674872997266% complete
17.057444314185226% complete
17.067213755373192% complete
17.07698319656116% complete
17.086752637749118% complete
17.096522078937085% complete
17.10629152012505% complete
17.11606096131301% complete
17.125830402500977% complete
17.135599843688944% complete
17.145369284876903% complete
17.15513872606487% complete
17.164908167252833% complete
17.174677608440795% complete
17.184447049628762% complete
17.194216490816725% complete
17.203985932004688% complete
17.213755373192654% complete
17.223524814380617% complete
17.23329425556858% complete
17.243063696756543% complete
17.25283313794451% complete
17.262602579132473% complete
17.272372020320436% complete
17.282141461508402% complete
17.291910902696365% complete
17.30168034388433% comp

19.7831184056272% complete
19.792887846815162% complete
19.802657288003125% complete
19.812426729191092% complete
19.822196170379055% complete
19.831965611567018% complete
19.841735052754984% complete
19.851504493942944% complete
19.86127393513091% complete
19.871043376318877% complete
19.880812817506836% complete
19.890582258694803% complete
19.90035169988277% complete
19.91012114107073% complete
19.919890582258695% complete
19.929660023446658% complete
19.93942946463462% complete
19.949198905822588% complete
19.95896834701055% complete
19.968737788198514% complete
19.97850722938648% complete
19.988276670574443% complete
19.998046111762406% complete
20.007815552950373% complete
20.017584994138335% complete
20.0273544353263% complete
20.03712387651426% complete
20.046893317702228% complete
20.05666275889019% complete
20.066432200078154% complete
20.07620164126612% complete
20.085971082454083% complete
20.095740523642046% complete
20.105509964830013% complete
20.115279406017976% complet

22.596717467760843% complete
22.60648690894881% complete
22.61625635013677% complete
22.626025791324736% complete
22.635795232512702% complete
22.64556467370066% complete
22.65533411488863% complete
22.665103556076595% complete
22.674872997264554% complete
22.68464243845252% complete
22.694411879640487% complete
22.704181320828447% complete
22.713950762016413% complete
22.723720203204376% complete
22.73348964439234% complete
22.743259085580306% complete
22.75302852676827% complete
22.76279796795623% complete
22.772567409144198% complete
22.78233685033216% complete
22.792106291520124% complete
22.801875732708087% complete
22.811645173896054% complete
22.821414615084016% complete
22.83118405627198% complete
22.840953497459946% complete
22.85072293864791% complete
22.860492379835872% complete
22.87026182102384% complete
22.8800312622118% complete
22.889800703399764% complete
22.89957014458773% complete
22.909339585775694% complete
22.919109026963657% complete
22.928878468151623% complete


25.410316529894487% complete
25.420085971082457% complete
25.42985541227042% complete
25.43962485345838% complete
25.449394294646343% complete
25.459163735834313% complete
25.468933177022272% complete
25.478702618210235% complete
25.488472059398205% complete
25.498241500586165% complete
25.508010941774128% complete
25.517780382962098% complete
25.527549824150057% complete
25.53731926533802% complete
25.54708870652599% complete
25.55685814771395% complete
25.566627588901913% complete
25.576397030089883% complete
25.586166471277842% complete
25.595935912465805% complete
25.605705353653775% complete
25.615474794841735% complete
25.625244236029697% complete
25.63501367721766% complete
25.644783118405627% complete
25.65455255959359% complete
25.664322000781553% complete
25.67409144196952% complete
25.683860883157482% complete
25.693630324345445% complete
25.703399765533412% complete
25.713169206721375% complete
25.722938647909338% complete
25.732708089097304% complete
25.742477530285267% co

28.204376709652205% complete
28.21414615084017% complete
28.22391559202814% complete
28.233685033216098% complete
28.24345447440406% complete
28.25322391559203% complete
28.26299335677999% complete
28.272762797967953% complete
28.282532239155923% complete
28.292301680343883% complete
28.302071121531846% complete
28.311840562719816% complete
28.321610003907775% complete
28.331379445095738% complete
28.341148886283708% complete
28.350918327471668% complete
28.36068776865963% complete
28.3704572098476% complete
28.38022665103556% complete
28.389996092223523% complete
28.399765533411486% complete
28.409534974599453% complete
28.419304415787416% complete
28.42907385697538% complete
28.438843298163345% complete
28.448612739351308% complete
28.45838218053927% complete
28.468151621727237% complete
28.4779210629152% complete
28.487690504103163% complete
28.49745994529113% complete
28.507229386479093% complete
28.516998827667056% complete
28.526768268855022% complete
28.536537710042985% complete

31.017975771785856% complete
31.027745212973816% complete
31.03751465416178% complete
31.04728409534975% complete
31.05705353653771% complete
31.06682297772567% complete
31.07659241891364% complete
31.0863618601016% complete
31.096131301289564% complete
31.105900742477534% complete
31.115670183665493% complete
31.125439624853456% complete
31.135209066041426% complete
31.144978507229386% complete
31.15474794841735% complete
31.16451738960531% complete
31.174286830793278% complete
31.18405627198124% complete
31.193825713169204% complete
31.20359515435717% complete
31.213364595545134% complete
31.223134036733097% complete
31.232903477921063% complete
31.242672919109026% complete
31.25244236029699% complete
31.262211801484955% complete
31.27198124267292% complete
31.28175068386088% complete
31.291520125048848% complete
31.30128956623681% complete
31.311059007424774% complete
31.32082844861274% complete
31.330597889800703% complete
31.340367330988666% complete
31.35013677217663% complete
31

33.870652598671356% complete
33.88042203985932% complete
33.89019148104728% complete
33.89996092223525% complete
33.909730363423215% complete
33.91949980461118% complete
33.92926924579914% complete
33.939038686987104% complete
33.94880812817507% complete
33.95857756936303% complete
33.968347010551% complete
33.97811645173896% complete
33.987885892926926% complete
33.99765533411489% complete
34.00742477530285% complete
34.017194216490815% complete
34.026963657678785% complete
34.03673309886675% complete
34.04650254005471% complete
34.05627198124267% complete
34.06604142243064% complete
34.0758108636186% complete
34.08558030480657% complete
34.09534974599453% complete
34.105119187182495% complete
34.11488862837045% complete
34.12465806955842% complete
34.134427510746384% complete
34.14419695193435% complete
34.15396639312232% complete
34.16373583431028% complete
34.173505275498236% complete
34.183274716686206% complete
34.19304415787417% complete
34.20281359906213% complete
34.2125830402

36.71355998436889% complete
36.723329425556855% complete
36.733098866744825% complete
36.74286830793279% complete
36.75263774912075% complete
36.762407190308714% complete
36.77217663149668% complete
36.78194607268464% complete
36.79171551387261% complete
36.80148495506057% complete
36.811254396248536% complete
36.8210238374365% complete
36.83079327862446% complete
36.840562719812425% complete
36.850332161000395% complete
36.86010160218836% complete
36.86987104337632% complete
36.879640484564284% complete
36.88940992575225% complete
36.89917936694021% complete
36.90894880812817% complete
36.91871824931614% complete
36.928487690504106% complete
36.93825713169206% complete
36.94802657288003% complete
36.957796014067995% complete
36.96756545525596% complete
36.97733489644393% complete
36.98710433763189% complete
36.99687377881985% complete
37.00664322000782% complete
37.01641266119578% complete
37.02618210238374% complete
37.03595154357171% complete
37.045720984759676% complete
37.05549042

39.5662368112544% complete
39.57600625244236% complete
39.585775693630325% complete
39.59554513481829% complete
39.60531457600625% complete
39.61508401719422% complete
39.624853458382184% complete
39.63462289957015% complete
39.64439234075811% complete
39.65416178194607% complete
39.663931223134036% complete
39.673700664322% complete
39.68347010550997% complete
39.69323954669793% complete
39.70300898788589% complete
39.71277842907386% complete
39.72254787026182% complete
39.73231731144978% complete
39.74208675263775% complete
39.751856193825716% complete
39.76162563501367% complete
39.77139507620164% complete
39.781164517389605% complete
39.79093395857757% complete
39.80070339976554% complete
39.8104728409535% complete
39.82024228214146% complete
39.83001172332943% complete
39.83978116451739% complete
39.84955060570535% complete
39.859320046893316% complete
39.869089488081286% complete
39.87885892926924% complete
39.888628370457205% complete
39.898397811645175% complete
39.908167252833

42.4189136381399% complete
42.42868307932786% complete
42.43845252051583% complete
42.448221961703794% complete
42.45799140289176% complete
42.46776084407971% complete
42.47753028526768% complete
42.487299726455646% complete
42.49706916764361% complete
42.50683860883158% complete
42.51660805001954% complete
42.5263774912075% complete
42.53614693239547% complete
42.54591637358343% complete
42.555685814771394% complete
42.565455255959364% complete
42.57522469714733% complete
42.58499413833528% complete
42.59476357952325% complete
42.604533020711216% complete
42.61430246189918% complete
42.62407190308714% complete
42.63384134427511% complete
42.64361078546307% complete
42.65338022665103% complete
42.663149667839% complete
42.672919109026964% complete
42.68268855021493% complete
42.6924579914029% complete
42.70222743259085% complete
42.711996873778816% complete
42.721766314966786% complete
42.73153575615475% complete
42.74130519734271% complete
42.75107463853068% complete
42.76084407971864

45.261821023837435% complete
45.271590465025405% complete
45.28135990621337% complete
45.29112934740132% complete
45.30089878858929% complete
45.31066822977726% complete
45.32043767096522% complete
45.33020711215319% complete
45.33997655334115% complete
45.34974599452911% complete
45.35951543571708% complete
45.36928487690504% complete
45.379054318093004% complete
45.388823759280974% complete
45.39859320046894% complete
45.40836264165689% complete
45.418132082844856% complete
45.427901524032826% complete
45.43767096522079% complete
45.44744040640875% complete
45.45720984759672% complete
45.46697928878468% complete
45.47674872997264% complete
45.48651817116061% complete
45.496287612348574% complete
45.50605705353654% complete
45.51582649472451% complete
45.52559593591246% complete
45.535365377100426% complete
45.545134818288396% complete
45.55490425947636% complete
45.56467370066432% complete
45.574443141852285% complete
45.58421258304025% complete
45.59398202422821% complete
45.6037514

48.16334505666276% complete
48.17311449785072% complete
48.18288393903868% complete
48.19265338022665% complete
48.202422821414615% complete
48.21219226260258% complete
48.22196170379055% complete
48.231731144978504% complete
48.24150058616647% complete
48.25127002735444% complete
48.2610394685424% complete
48.27080890973036% complete
48.28057835091833% complete
48.29034779210629% complete
48.30011723329425% complete
48.30988667448222% complete
48.319656115670185% complete
48.32942555685815% complete
48.33919499804612% complete
48.348964439234074% complete
48.35873388042204% complete
48.36850332161% complete
48.37827276279797% complete
48.38804220398593% complete
48.397811645173896% complete
48.40758108636186% complete
48.41735052754982% complete
48.427119968737784% complete
48.436889409925755% complete
48.44665885111372% complete
48.45642829230168% complete
48.46619773348964% complete
48.475967174677606% complete
48.48573661586557% complete
48.49550605705354% complete
48.5052754982415

51.0062524423603% complete
51.016021883548255% complete
51.025791324736225% complete
51.035560765924195% complete
51.04533020711215% complete
51.055099648300114% complete
51.064869089488084% complete
51.07463853067604% complete
51.08440797186401% complete
51.09417741305198% complete
51.103946854239936% complete
51.1137162954279% complete
51.12348573661587% complete
51.133255177803825% complete
51.143024618991795% complete
51.152794060179765% complete
51.16256350136772% complete
51.172332942555684% complete
51.182102383743654% complete
51.19187182493161% complete
51.20164126611958% complete
51.21141070730755% complete
51.221180148495506% complete
51.23094958968347% complete
51.24071903087143% complete
51.250488472059395% complete
51.260257913247365% complete
51.27002735443532% complete
51.27979679562329% complete
51.289566236811254% complete
51.29933567799922% complete
51.30910511918718% complete
51.31887456037515% complete
51.328644001563106% complete
51.338413442751076% complete
51.34

53.839390386869866% complete
53.849159828057836% complete
53.858929269245806% complete
53.86869871043376% complete
53.878468151621725% complete
53.888237592809695% complete
53.89800703399765% complete
53.90777647518562% complete
53.91754591637359% complete
53.92731535756155% complete
53.93708479874951% complete
53.94685423993748% complete
53.956623681125436% complete
53.966393122313406% complete
53.976162563501376% complete
53.98593200468933% complete
53.995701445877295% complete
54.00547088706526% complete
54.01524032825322% complete
54.02500976944119% complete
54.03477921062915% complete
54.04454865181712% complete
54.05431809300508% complete
54.06408753419304% complete
54.073856975381005% complete
54.083626416568976% complete
54.09339585775693% complete
54.1031652989449% complete
54.112934740132864% complete
54.12270418132083% complete
54.13247362250879% complete
54.14224306369676% complete
54.152012504884716% complete
54.161781946072686% complete
54.17155138726065% complete
54.1813

56.69206721375537% complete
56.701836654943335% complete
56.711606096131305% complete
56.72137553731926% complete
56.73114497850723% complete
56.7409144196952% complete
56.75068386088316% complete
56.76045330207112% complete
56.77022274325909% complete
56.779992184447046% complete
56.789761625635016% complete
56.79953106682297% complete
56.80930050801094% complete
56.819069949198905% complete
56.82883939038687% complete
56.83860883157483% complete
56.8483782727628% complete
56.85814771395076% complete
56.86791715513873% complete
56.87768659632669% complete
56.88745603751465% complete
56.897225478702616% complete
56.906994919890586% complete
56.91676436107854% complete
56.92653380226651% complete
56.936303243454475% complete
56.94607268464244% complete
56.9558421258304% complete
56.96561156701837% complete
56.97538100820633% complete
56.9851504493943% complete
56.99491989058226% complete
57.00468933177022% complete
57.014458772958186% complete
57.024228214146156% complete
57.03399765533

59.54474404064087% complete
59.55451348182884% complete
59.5642829230168% complete
59.57405236420477% complete
59.58382180539273% complete
59.593591246580694% complete
59.60336068776866% complete
59.61313012895663% complete
59.62289957014458% complete
59.63266901133255% complete
59.642438452520516% complete
59.65220789370848% complete
59.66197733489644% complete
59.67174677608441% complete
59.68151621727237% complete
59.69128565846034% complete
59.7010550996483% complete
59.71082454083626% complete
59.720593982024226% complete
59.7303634232122% complete
59.74013286440015% complete
59.74990230558812% complete
59.759671746776085% complete
59.76944118796405% complete
59.77921062915201% complete
59.78898007033998% complete
59.79874951152794% complete
59.80851895271591% complete
59.81828839390387% complete
59.82805783509183% complete
59.837827276279796% complete
59.847596717467766% complete
59.85736615865572% complete
59.86713559984369% complete
59.876905041031655% complete
59.8866744822196

62.416959749902304% complete
62.42672919109027% complete
62.43649863227824% complete
62.44626807346619% complete
62.45603751465416% complete
62.465806955842126% complete
62.47557639703009% complete
62.48534583821805% complete
62.49511527940602% complete
62.50488472059398% complete
62.51465416178195% complete
62.52442360296991% complete
62.534193044157874% complete
62.54396248534584% complete
62.55373192653381% complete
62.56350136772176% complete
62.57327080890973% complete
62.583040250097696% complete
62.59280969128566% complete
62.60257913247362% complete
62.61234857366159% complete
62.62211801484955% complete
62.63188745603752% complete
62.64165689722548% complete
62.651426338413444% complete
62.66119577960141% complete
62.67096522078938% complete
62.68073466197733% complete
62.6905041031653% complete
62.70027354435326% complete
62.71004298554123% complete
62.71981242672919% complete
62.72958186791715% complete
62.73935130910512% complete
62.74912075029309% complete
62.7588901914810

65.28917545916374% complete
65.2989449003517% complete
65.30871434153967% complete
65.31848378272763% complete
65.3282532239156% complete
65.33802266510355% complete
65.34779210629152% complete
65.35756154747948% complete
65.36733098866745% complete
65.37710042985542% complete
65.38686987104337% complete
65.39663931223134% complete
65.40640875341931% complete
65.41617819460727% complete
65.42594763579524% complete
65.4357170769832% complete
65.44548651817117% complete
65.45525595935912% complete
65.46502540054708% complete
65.47479484173505% complete
65.48456428292302% complete
65.49433372411097% complete
65.50410316529894% complete
65.51387260648691% complete
65.52364204767487% complete
65.53341148886284% complete
65.54318093005081% complete
65.55295037123877% complete
65.56271981242674% complete
65.57248925361469% complete
65.58225869480265% complete
65.59202813599062% complete
65.60179757717859% complete
65.61156701836654% complete
65.62133645955451% complete
65.63110590074248% comp

68.17116060961314% complete
68.1809300508011% complete
68.19069949198906% complete
68.20046893317702% complete
68.21023837436499% complete
68.22000781555295% complete
68.2297772567409% complete
68.23954669792887% complete
68.24931613911684% complete
68.2590855803048% complete
68.26885502149277% complete
68.27862446268074% complete
68.2883939038687% complete
68.29816334505666% complete
68.30793278624463% complete
68.31770222743259% complete
68.32747166862056% complete
68.33724110980852% complete
68.34701055099647% complete
68.35677999218444% complete
68.36654943337241% complete
68.37631887456037% complete
68.38608831574834% complete
68.39585775693631% complete
68.40562719812426% complete
68.41539663931223% complete
68.4251660805002% complete
68.43493552168816% complete
68.44470496287613% complete
68.45447440406409% complete
68.46424384525204% complete
68.47401328644001% complete
68.48378272762798% complete
68.49355216881594% complete
68.50332161000391% complete
68.51309105119188% comple

71.05314576006252% complete
71.06291520125049% complete
71.07268464243846% complete
71.08245408362642% complete
71.09222352481439% complete
71.10199296600234% complete
71.1117624071903% complete
71.12153184837827% complete
71.13130128956624% complete
71.1410707307542% complete
71.15084017194216% complete
71.16060961313013% complete
71.17037905431809% complete
71.18014849550606% complete
71.18991793669403% complete
71.19968737788199% complete
71.20945681906996% complete
71.21922626025791% complete
71.22899570144587% complete
71.23876514263384% complete
71.24853458382181% complete
71.25830402500976% complete
71.26807346619773% complete
71.2778429073857% complete
71.28761234857366% complete
71.29738178976163% complete
71.3071512309496% complete
71.31692067213756% complete
71.32669011332553% complete
71.33645955451348% complete
71.34622899570144% complete
71.35599843688941% complete
71.36576787807736% complete
71.37553731926533% complete
71.3853067604533% complete
71.39507620164126% comple

73.93513091051192% complete
73.94490035169989% complete
73.95466979288786% complete
73.96443923407581% complete
73.97420867526378% complete
73.98397811645174% complete
73.9937475576397% complete
74.00351699882766% complete
74.01328644001563% complete
74.02305588120359% complete
74.03282532239156% complete
74.04259476357953% complete
74.05236420476749% complete
74.06213364595546% complete
74.07190308714343% complete
74.08167252833138% complete
74.09144196951935% complete
74.10121141070731% complete
74.11098085189526% complete
74.12075029308323% complete
74.1305197342712% complete
74.14028917545916% complete
74.15005861664713% complete
74.15982805783509% complete
74.16959749902306% complete
74.17936694021103% complete
74.18913638139898% complete
74.19890582258695% complete
74.20867526377492% complete
74.21844470496288% complete
74.22821414615083% complete
74.2379835873388% complete
74.24775302852676% complete
74.25752246971473% complete
74.2672919109027% complete
74.27706135209066% compl

76.81711606096131% complete
76.82688550214928% complete
76.83665494333725% complete
76.8464243845252% complete
76.85619382571318% complete
76.86596326690113% complete
76.87573270808909% complete
76.88550214927706% complete
76.89527159046503% complete
76.90504103165298% complete
76.91481047284095% complete
76.92457991402891% complete
76.93434935521688% complete
76.94411879640485% complete
76.9538882375928% complete
76.96365767878078% complete
76.97342711996875% complete
76.9831965611567% complete
76.99296600234466% complete
77.00273544353263% complete
77.01250488472058% complete
77.02227432590855% complete
77.03204376709652% complete
77.04181320828448% complete
77.05158264947245% complete
77.06135209066042% complete
77.07112153184838% complete
77.08089097303635% complete
77.09066041422432% complete
77.10042985541227% complete
77.11019929660023% complete
77.1199687377882% complete
77.12973817897615% complete
77.13950762016412% complete
77.1492770613521% complete
77.15904650254005% comple

79.6991012114107% complete
79.70887065259868% complete
79.71864009378663% complete
79.7284095349746% complete
79.73817897616257% complete
79.74794841735053% complete
79.75771785853848% complete
79.76748729972645% complete
79.77725674091441% complete
79.78702618210238% complete
79.79679562329035% complete
79.8065650644783% complete
79.81633450566628% complete
79.82610394685425% complete
79.8358733880422% complete
79.84564282923017% complete
79.85541227041814% complete
79.8651817116061% complete
79.87495115279405% complete
79.88472059398202% complete
79.89449003516998% complete
79.90425947635795% complete
79.91402891754592% complete
79.92379835873388% complete
79.93356779992185% complete
79.94333724110982% complete
79.95310668229777% complete
79.96287612348574% complete
79.97264556467371% complete
79.98241500586167% complete
79.99218444704962% complete
80.0019538882376% complete
80.01172332942555% complete
80.02149277061352% complete
80.03126221180149% complete
80.04103165298945% complet

82.5810863618601% complete
82.59085580304807% complete
82.60062524423603% complete
82.610394685424% complete
82.62016412661197% complete
82.62993356779992% complete
82.63970300898788% complete
82.64947245017585% complete
82.6592418913638% complete
82.66901133255178% complete
82.67878077373975% complete
82.6885502149277% complete
82.69831965611567% complete
82.70808909730364% complete
82.7178585384916% complete
82.72762797967957% complete
82.73739742086754% complete
82.7471668620555% complete
82.75693630324345% complete
82.76670574443142% complete
82.77647518561938% complete
82.78624462680735% complete
82.79601406799532% complete
82.80578350918327% complete
82.81555295037124% complete
82.8253223915592% complete
82.83509183274717% complete
82.84486127393514% complete
82.8546307151231% complete
82.86440015631106% complete
82.87416959749902% complete
82.88393903868698% complete
82.89370847987495% complete
82.90347792106292% complete
82.91324736225087% complete
82.92301680343884% complete
8

85.4630715123095% complete
85.47284095349747% complete
85.48261039468542% complete
85.4923798358734% complete
85.50214927706136% complete
85.51191871824932% complete
85.52168815943728% complete
85.53145760062525% complete
85.5412270418132% complete
85.55099648300117% complete
85.56076592418914% complete
85.5705353653771% complete
85.58030480656507% complete
85.59007424775304% complete
85.599843688941% complete
85.60961313012896% complete
85.61938257131692% complete
85.62915201250489% complete
85.63892145369284% complete
85.64869089488081% complete
85.65846033606877% complete
85.66822977725674% complete
85.6779992184447% complete
85.68776865963267% complete
85.69753810082064% complete
85.70730754200859% complete
85.71707698319656% complete
85.72684642438453% complete
85.73661586557249% complete
85.74638530676046% complete
85.75615474794841% complete
85.76592418913637% complete
85.77569363032434% complete
85.78546307151231% complete
85.79523251270027% complete
85.80500195388824% complete

88.34505666275889% complete
88.35482610394686% complete
88.36459554513482% complete
88.37436498632279% complete
88.38413442751074% complete
88.39390386869871% complete
88.40367330988667% complete
88.41344275107463% complete
88.4232121922626% complete
88.43298163345057% complete
88.44275107463852% complete
88.45252051582649% complete
88.46228995701446% complete
88.47205939820242% complete
88.48182883939039% complete
88.49159828057836% complete
88.50136772176631% complete
88.51113716295428% complete
88.52090660414224% complete
88.53067604533021% complete
88.54044548651817% complete
88.55021492770614% complete
88.55998436889409% complete
88.56975381008206% complete
88.57952325127003% complete
88.58929269245799% complete
88.59906213364596% complete
88.60883157483393% complete
88.61860101602188% complete
88.62837045720985% complete
88.63813989839781% complete
88.64790933958577% complete
88.65767878077374% complete
88.6674482219617% complete
88.67721766314966% complete
88.68698710433763% com

91.22704181320829% complete
91.23681125439624% complete
91.24658069558421% complete
91.25635013677218% complete
91.26611957796014% complete
91.27588901914811% complete
91.28565846033607% complete
91.29542790152402% complete
91.30519734271199% complete
91.31496678389996% complete
91.32473622508792% complete
91.33450566627589% complete
91.34427510746386% complete
91.35404454865181% complete
91.36381398983978% complete
91.37358343102775% complete
91.38335287221571% complete
91.39312231340368% complete
91.40289175459164% complete
91.4126611957796% complete
91.42243063696756% complete
91.43220007815553% complete
91.44196951934349% complete
91.45173896053146% complete
91.46150840171943% complete
91.47127784290738% complete
91.48104728409535% complete
91.49081672528332% complete
91.50058616647128% complete
91.51035560765925% complete
91.5201250488472% complete
91.52989449003516% complete
91.53966393122313% complete
91.54943337241109% complete
91.55920281359906% complete
91.56897225478703% com

94.10902696365768% complete
94.11879640484564% complete
94.12856584603361% complete
94.13833528722158% complete
94.14810472840954% complete
94.1578741695975% complete
94.16764361078546% complete
94.17741305197342% complete
94.18718249316139% complete
94.19695193434936% complete
94.20672137553731% complete
94.21649081672528% complete
94.22626025791325% complete
94.23602969910121% complete
94.24579914028918% complete
94.25556858147715% complete
94.2653380226651% complete
94.27510746385308% complete
94.28487690504103% complete
94.294646346229% complete
94.30441578741696% complete
94.31418522860491% complete
94.32395466979288% complete
94.33372411098085% complete
94.34349355216881% complete
94.35326299335678% complete
94.36303243454475% complete
94.3728018757327% complete
94.38257131692068% complete
94.39234075810865% complete
94.4021101992966% complete
94.41187964048456% complete
94.42164908167253% complete
94.43141852286048% complete
94.44118796404845% complete
94.45095740523642% complet

96.99101211410708% complete
97.00078155529503% complete
97.010550996483% complete
97.02032043767097% complete
97.03008987885893% complete
97.0398593200469% complete
97.04962876123486% complete
97.05939820242281% complete
97.06916764361078% complete
97.07893708479874% complete
97.08870652598671% complete
97.09847596717468% complete
97.10824540836263% complete
97.1180148495506% complete
97.12778429073857% complete
97.13755373192653% complete
97.1473231731145% complete
97.15709261430247% complete
97.16686205549043% complete
97.1766314966784% complete
97.18640093786635% complete
97.19617037905431% complete
97.20593982024228% complete
97.21570926143025% complete
97.2254787026182% complete
97.23524814380617% complete
97.24501758499414% complete
97.2547870261821% complete
97.26455646737007% complete
97.27432590855804% complete
97.284095349746% complete
97.29386479093395% complete
97.30363423212192% complete
97.31340367330988% complete
97.32317311449785% complete
97.33294255568582% complete
97

99.87299726455646% complete
99.88276670574443% complete
99.8925361469324% complete
99.90230558812036% complete
99.91207502930833% complete
99.9218444704963% complete
99.93161391168425% complete
99.94138335287221% complete
99.95115279406018% complete
99.96092223524813% complete
99.9706916764361% complete
99.98046111762407% complete
99.99023055881203% complete


In [13]:
column_names = ['term', 'idf']
df_df = pd.DataFrame(columns = column_names)
element = 'word'
df_val = 0.034
df2 = pd.DataFrame([[element, df_val]], columns = column_names)
df_df = df_df.append(df2)
df_df

Unnamed: 0,term,idf
0,word,0.034


In [14]:
df3 = pd.DataFrame([['new_word', 0.5]], columns = column_names)
df_df = df_df.append(df3)
df_df

# full_comments_df['tfidf'] = tfidf_on_dataset(full_topics_df, full_comments_df)
# full_comments_df

Unnamed: 0,term,idf
0,word,0.034
0,new_word,0.5


In [87]:
full_comments_df.to_csv("files/compiled_comments_w_tfidf.csv", index=False)

In [9]:
import pandas as pd
comments_df = pd.read_csv('files/compiled_comments_2_20_2021.csv')
comments_df

Unnamed: 0,action,content,author,details,submissionId,commentId,WordScore,WholeScore,tfidf,contains_url,text_without_url
0,,So what are the implications here? Does it onl...,Cody_Fox23,,4op948,d4eictg,0.000000,0.849655,0.001573,False,So what are the implications here? Does it onl...
1,,Sadly this isn't new. Police officers use many...,DrFrenchman,,4sef35,d58ts90,0.000000,0.900283,0.255802,False,Sadly this isn't new. Police officers use many...
2,,What's disturbing about this is that our gover...,bbakks,,4sef35,d58y081,-0.038865,0.869078,0.000000,False,What's disturbing about this is that our gover...
3,,What I find really concerning is the horrible ...,poliscijunki,,4sef35,d5919n8,0.000000,0.898426,0.000000,True,What I find really concerning is the horrible ...
4,,This subject might have legs but this article ...,interweb1,,64zsim,dg6l969,0.000000,0.850127,0.000000,False,This subject might have legs but this article ...
...,...,...,...,...,...,...,...,...,...,...,...
11443,,"Yes, while in East Baghdad my platoons mission...",CapitalCockroach,,bav0rl,ekggrgk,1.070477,0.840028,0.000000,True,"Yes, while in East Baghdad my platoons mission..."
11444,,The [definition the FBI currently uses for int...,CQME,,bav0rl,ekyelps,0.941533,0.882768,0.217543,True,The [definition the FBI currently uses for int...
11445,,[Yes.](https://en.m.wikipedia.org/wiki/Islamic...,Silent_As_The_Grave_,,bav0rl,ekehcqg,0.217683,0.779386,0.000000,True,[Yes.] Have a look at who they are allies with...
11446,,Has ANY Shia ever committed an act of terroris...,bsmdphdjd,,bav0rl,ekfp4ls,1.293729,0.861529,0.000000,False,Has ANY Shia ever committed an act of terroris...


Trying to do some work on comment upvote feature in this notebook since the other one won't let me interrupt the kernel for some reason

In [24]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer

from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LassoCV
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import ElasticNetCV
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
import warnings
warnings.filterwarnings('ignore')

import os
from dotenv import load_dotenv
load_dotenv()

CLIENT_ID = os.getenv('CLIENT_ID')
CLIENT_SECRET=os.getenv('CLIENT_SECRET')
APP_NAME=os.getenv('APP_NAME')
REDDIT_USERNAME=os.getenv('REDDIT_USERNAME')
REDDIT_PASSWORD=os.getenv('REDDIT_PASSWORD')

import praw
import pandas as pd
import datetime as dt

reddit = praw.Reddit(client_id=CLIENT_ID, client_secret=CLIENT_SECRET, user_agent=APP_NAME, username=REDDIT_USERNAME, password=REDDIT_PASSWORD)

print(reddit.user.me())

%matplotlib inline

Version 7.1.0 of praw is outdated. Version 7.1.4 was released Sunday February 07, 2021.


mattcat26


In [None]:
comments_df = pd.read_csv('files/compiled_comments_2_20_2021.csv')

def grab_comment_upvotes(com_df):
    comment_score_arr = []
    
    sum_upvotes_arr = []
    dit = {}
    count = 1
    for (index, action, content, author, details, submissionId, commentId, WordScore, WholeScore, tfidf, contains_url, text_without_url) in com_df.itertuples(name=None):
        comment = 'fake 627819 comment'
        try:
            comment = reddit.comment(commentId)
            comment_score_arr.append(comment.score)
        except:
            #If a comment cannot be read for some reason, it's upvote score will be zero
            comment_score_arr.append(0)
        
        if comment != 'fake 627819 comment':
            try:
                if submissionId in dit:
                    dit[submissionId] += comment.score
                else:
                    dit[submissionId] = comment.score
            except:
                if 'number_of_errors' in dit:
                    dit['number_of_errors'] += 1
                else:
                    dit['number_of_errors'] = 1
        if count % 50 == 0:
            print(str((index/(len(com_df.index)))*100) + "% complete (phase 1)")
            return comment.score
        count += 1
    
    count = 1
    for (index, action, content, author, details, submissionId, commentId, WordScore, WholeScore, tfidf, contains_url, text_without_url) in com_df.itertuples(name=None):
        try:
            sum_upvotes_arr.append(dit[submissionId])
        except:
            #If there are no upvotes on any comments or if there was an error, sum of upvotes will be zero
            #This could be confusing for analysis so it's something to keep in mind
            sum_upvotes_arr.append(0)
        if count % 50 == 0:
            print(str((index/(len(com_df.index)))*100) + "% complete (phase 2)")
        count+=1
    
    return comment_score_arr, sum_upvotes_arr, dit

comment = grab_comment_upvotes(comments_df)
# comments_df['comment_score'], comments_df['all_comments_scores'], new_dit = grab_comment_upvotes(comments_df)

In [None]:
comment