## PRAW

Python Reddit API Wrapper

- limitations: cant access all subs and comments in the subreddit - only can access new, hot, top, controversial, gilded. 
- even when limit is set at 5000, doesnt even reach 1k results

https://praw.readthedocs.io/en/v7.7.0/code_overview/models/subreddit.html#praw.models.Subreddit

### Submissions Attributes
- `id`: post id,
- `title`: post title,
- `text`: post selftext (body),
- `subreddit`: post subreddit,
- `author`: post author username,
- `upvotes`: post upvotes,
- `downvotes`: post downvotes
- `upvote_ratio`: post upvote_ratio,
- `score`: post score, # net upvotes
- `created_utc`: post created date,
- `url`: post permalink,
- `link`: post url, # photos/videos/media
- `tags`: post tags,
- `num_comments`: post num_comments

### Comments Attributes
- `id`: post id,
- `text`: post text,
- `subreddit`: post subreddit,           
- `author`: post author username,
- `upvotes`: post upvotes,
- `downvotes`: post downvotes,
- `score`: post score, # net upvotes
- `created_utc`: post created date,
- `url`: post permalink,
- `link`: post.link_url,
- `parent_id`: parent post

In [2]:
import praw
from dotenv import dotenv_values
import pandas as pd

# retrieve dotenv config
config = dotenv_values(".env")

reddit = praw.Reddit(
    client_id = config['CLIENT_ID'],
    client_secret = config['CLIENT_SECRET'],
    user_agent = config['USER_AGENT'],
    
)

In [3]:
def get_posts(generator):
    res_list = []
    for i, post in enumerate(generator):
        sub = {
                'id': post.id,
                'title': post.title,
                'text': post.selftext,
                'subreddit': post.subreddit.display_name,
                'author': post.author.name if post.author is not None else None,
                'upvotes': post.ups,
                'downvotes': post.downs,
                'upvote_ratio': post.upvote_ratio,
                'score': post.score,
                'created_utc': post.created_utc,
                'url': 'reddit.com'+ post.permalink,
                'link': post.url,
                'tags': post.link_flair_richtext,
                'num_comments': post.num_comments
            }
        
        res_list.append(sub)
        
    return res_list
        
def get_comments(generator):
    res_list = []
    
    for i, post in enumerate(generator):
        cmt = {
                'id': post.id,
                'text': post.body,
                'subreddit': post.subreddit.display_name,
                'author': post.author.name if post.author is not None else None,
                'upvotes': post.ups,
                'downvotes': post.downs,
                'score': post.score,
                'created_utc': post.created_utc,
                'url': post.link_permalink,
                'link': post.link_url,
                'parent_id': post.parent_id
            }
        
        res_list.append(cmt)
        
    return res_list

In [None]:
hot1 = reddit.subreddit('elonmusk').hot(limit=1000)
sub = next(hot1)

#view attributes - lmk if there are other attributes we need
sub.__dict__

{'comment_limit': 2048,
 'comment_sort': 'confidence',
 '_reddit': <praw.reddit.Reddit at 0x106aee510>,
 'approved_at_utc': None,
 'subreddit': Subreddit(display_name='elonmusk'),
 'selftext': '',
 'author_fullname': 't2_afjb8oz6',
 'saved': False,
 'mod_reason_title': None,
 'gilded': 0,
 'clicked': False,
 'title': 'Liftoff.54321.',
 'link_flair_richtext': [{'e': 'text', 't': 'SpaceX'}],
 'subreddit_name_prefixed': 'r/elonmusk',
 'hidden': False,
 'pwls': 6,
 'link_flair_css_class': '',
 'downs': 0,
 'thumbnail_height': 78,
 'top_awarded_type': None,
 'hide_score': False,
 'name': 't3_12150g4',
 'quarantine': False,
 'link_flair_text_color': 'light',
 'upvote_ratio': 0.92,
 'author_flair_background_color': None,
 'ups': 331,
 'total_awards_received': 0,
 'media_embed': {},
 'thumbnail_width': 140,
 'author_flair_template_id': None,
 'is_original_content': False,
 'user_reports': [],
 'secure_media': {'reddit_video': {'bitrate_kbps': 2400,
   'fallback_url': 'https://v.redd.it/e72u906

In [None]:
# generator objects
hot = reddit.subreddit('elonmusk').hot(limit=1000)
new = reddit.subreddit('elonmusk').new(limit=1000)
top = reddit.subreddit('elonmusk').top(limit=1000)
all_subreds = reddit.subreddit('all').search('elon musk', limit=5000)

comments = reddit.subreddit('elonmusk').comments(limit=1000)

In [None]:
# parse generator objects
hotlist = get_posts(hot)
newlist = get_posts(new)
toplist = get_posts(top)
all_sub_list = get_posts(all_subreds)
comment_list = get_comments(comments)

In [None]:
comment_df = pd.DataFrame(comment_list)
comment_df

Unnamed: 0,id,text,subreddit,author,upvotes,downvotes,score,created_utc,url,link,parent_id
0,jdmrbax,Sounds like he’s controlling. I go to my offic...,elonmusk,mari815,1,0,1,1.679759e+09,https://www.reddit.com/r/elonmusk/comments/121...,https://www.kumaonjagran.com/elon-musk-tells-t...,t3_121ofyl
1,jdmpl8k,Get their butts into work and out of those paj...,elonmusk,Drone_Boss,1,0,1,1.679758e+09,https://www.reddit.com/r/elonmusk/comments/121...,https://www.kumaonjagran.com/elon-musk-tells-t...,t3_121ofyl
2,jdmoyi0,I did a trial. I went to the office for 5 days...,elonmusk,thethreat88,1,0,1,1.679758e+09,https://www.reddit.com/r/elonmusk/comments/121...,https://www.kumaonjagran.com/elon-musk-tells-t...,t3_121ofyl
3,jdmn89u,Boss up,elonmusk,Zealousideal-Ice9173,1,0,1,1.679757e+09,https://www.reddit.com/r/elonmusk/comments/121...,https://www.kumaonjagran.com/elon-musk-tells-t...,t1_jdmm239
4,jdmm239,"""Boss says going to work not optional"" we need...",elonmusk,Aquada_,1,0,1,1.679757e+09,https://www.reddit.com/r/elonmusk/comments/121...,https://www.kumaonjagran.com/elon-musk-tells-t...,t3_121ofyl
...,...,...,...,...,...,...,...,...,...,...,...
654,jcqzbcd,"If you want to improve the medical system, the...",elonmusk,pyguy6,5,0,5,1.679177e+09,https://www.reddit.com/r/elonmusk/comments/11t...,https://www.httnews.com/business/elon-musk-spa...,t1_jcqye7y
655,jcqye7y,Prescribed off lable. And kids and teenagers a...,elonmusk,Least777,0,0,0,1.679176e+09,https://www.reddit.com/r/elonmusk/comments/11t...,https://www.httnews.com/business/elon-musk-spa...,t1_jcqvoyn
656,jcqxbke,She is spot on.mate,elonmusk,ZaroonKhan5,2,0,2,1.679176e+09,https://www.reddit.com/r/elonmusk/comments/11t...,https://v.redd.it/jkw5p81147oa1,t1_jcnlhw8
657,jcqvoyn,"Thing is puberty is also irreversible, which i...",elonmusk,pyguy6,5,0,5,1.679175e+09,https://www.reddit.com/r/elonmusk/comments/11t...,https://www.httnews.com/business/elon-musk-spa...,t1_jcqufyl


In [None]:
# comment_df.to_csv('Raw datasets/comments.csv')

In [None]:
combinedlist = []
combinedlist.extend(hotlist)
combinedlist.extend(newlist)
combinedlist.extend(toplist)
combinedlist.extend(all_sub_list)
len(combinedlist)

2672

In [None]:
df = pd.DataFrame(combinedlist)
df

Unnamed: 0,id,title,text,subreddit,author,upvotes,downvotes,upvote_ratio,score,created_utc,url,link,tags,num_comments
0,12150g4,Liftoff.54321.,,elonmusk,ZaroonKhan5,324,0,0.92,324,1.679705e+09,reddit.com/r/elonmusk/comments/12150g4/liftoff...,https://v.redd.it/e72u906e8spa1,"[{'e': 'text', 't': 'SpaceX'}]",18
1,121ofyl,"Elon Musk tells Twitter employees ""Office is n...",,elonmusk,Express_Turn_5489,6,0,0.67,6,1.679756e+09,reddit.com/r/elonmusk/comments/121ofyl/elon_mu...,https://www.kumaonjagran.com/elon-musk-tells-t...,"[{'e': 'text', 't': 'Tweets'}]",10
2,120n1qh,Musk denies multibillion investment in SpaceX ...,,elonmusk,Alex_ZH1,349,0,0.92,349,1.679670e+09,reddit.com/r/elonmusk/comments/120n1qh/musk_de...,https://www.quicktechnics.com/en/post/musk-den...,"[{'e': 'text', 't': 'SpaceX'}]",69
3,1215ef2,Elon Musk believes Neuralink can help address ...,,elonmusk,Pawnti,6,0,0.61,6,1.679706e+09,reddit.com/r/elonmusk/comments/1215ef2/elon_mu...,https://www.httnews.com/technology/eyi02xa6agb...,"[{'e': 'text', 't': 'Tweets'}]",12
4,11zpwfi,"WHO Warns Of ""Fake News"" After Elon Musk Pande...",,elonmusk,erinswider,281,0,0.86,281,1.679590e+09,reddit.com/r/elonmusk/comments/11zpwfi/who_war...,https://globenewsbulletin.com/world/who-warns-...,"[{'e': 'text', 't': 'Tweets'}]",165
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2667,11er3dn,Let’s see how his free speech pans out: China’...,Smart money’s on him shutting up fairly quickly,RealTwitterAccounts,defectivetrashdetect,1057,0,0.97,1057,1.677636e+09,reddit.com/r/RealTwitterAccounts/comments/11er...,https://www.cnbc.com/2023/02/28/chinas-ccp-war...,"[{'e': 'text', 't': 'Political™'}]",95
2668,120cvcv,Texans blast Elon Musk's Boring plan to dump w...,,RealTesla,Zorkmid123,274,0,0.95,274,1.679643e+09,reddit.com/r/RealTesla/comments/120cvcv/texans...,https://www.foxnews.com/us/texans-blast-elon-m...,[],51
2669,10efs58,"For Glass Onion, Rian Johnson convinced elon m...",,shittymoviedetails,of_kilter,16439,0,0.91,16439,1.673971e+09,reddit.com/r/shittymoviedetails/comments/10efs...,https://i.redd.it/mrjmj1f34oca1.jpg,[],294
2670,zsghfu,Elon Musk getting owned by a former Twitter en...,,facepalm,eichenes,10693,0,0.95,10693,1.671696e+09,reddit.com/r/facepalm/comments/zsghfu/elon_mus...,https://v.redd.it/azr15yt2qe7a1,"[{'a': ':Misc:', 'e': 'emoji', 'u': 'https://e...",1031


In [None]:
# remove duplicates
df = df.loc[df.astype(str).drop_duplicates('id').index]
df

Unnamed: 0,id,title,text,subreddit,author,upvotes,downvotes,upvote_ratio,score,created_utc,url,link,tags,num_comments
0,12150g4,Liftoff.54321.,,elonmusk,ZaroonKhan5,324,0,0.92,324,1.679705e+09,reddit.com/r/elonmusk/comments/12150g4/liftoff...,https://v.redd.it/e72u906e8spa1,"[{'e': 'text', 't': 'SpaceX'}]",18
1,121ofyl,"Elon Musk tells Twitter employees ""Office is n...",,elonmusk,Express_Turn_5489,6,0,0.67,6,1.679756e+09,reddit.com/r/elonmusk/comments/121ofyl/elon_mu...,https://www.kumaonjagran.com/elon-musk-tells-t...,"[{'e': 'text', 't': 'Tweets'}]",10
2,120n1qh,Musk denies multibillion investment in SpaceX ...,,elonmusk,Alex_ZH1,349,0,0.92,349,1.679670e+09,reddit.com/r/elonmusk/comments/120n1qh/musk_de...,https://www.quicktechnics.com/en/post/musk-den...,"[{'e': 'text', 't': 'SpaceX'}]",69
3,1215ef2,Elon Musk believes Neuralink can help address ...,,elonmusk,Pawnti,6,0,0.61,6,1.679706e+09,reddit.com/r/elonmusk/comments/1215ef2/elon_mu...,https://www.httnews.com/technology/eyi02xa6agb...,"[{'e': 'text', 't': 'Tweets'}]",12
4,11zpwfi,"WHO Warns Of ""Fake News"" After Elon Musk Pande...",,elonmusk,erinswider,281,0,0.86,281,1.679590e+09,reddit.com/r/elonmusk/comments/11zpwfi/who_war...,https://globenewsbulletin.com/world/who-warns-...,"[{'e': 'text', 't': 'Tweets'}]",165
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2667,11er3dn,Let’s see how his free speech pans out: China’...,Smart money’s on him shutting up fairly quickly,RealTwitterAccounts,defectivetrashdetect,1057,0,0.97,1057,1.677636e+09,reddit.com/r/RealTwitterAccounts/comments/11er...,https://www.cnbc.com/2023/02/28/chinas-ccp-war...,"[{'e': 'text', 't': 'Political™'}]",95
2668,120cvcv,Texans blast Elon Musk's Boring plan to dump w...,,RealTesla,Zorkmid123,274,0,0.95,274,1.679643e+09,reddit.com/r/RealTesla/comments/120cvcv/texans...,https://www.foxnews.com/us/texans-blast-elon-m...,[],51
2669,10efs58,"For Glass Onion, Rian Johnson convinced elon m...",,shittymoviedetails,of_kilter,16439,0,0.91,16439,1.673971e+09,reddit.com/r/shittymoviedetails/comments/10efs...,https://i.redd.it/mrjmj1f34oca1.jpg,[],294
2670,zsghfu,Elon Musk getting owned by a former Twitter en...,,facepalm,eichenes,10693,0,0.95,10693,1.671696e+09,reddit.com/r/facepalm/comments/zsghfu/elon_mus...,https://v.redd.it/azr15yt2qe7a1,"[{'a': ':Misc:', 'e': 'emoji', 'u': 'https://e...",1031


In [None]:
# df.to_csv('Raw datasets/submissions.csv')

In [1]:
import pandas as pd

In [8]:
submissions = pd.read_csv('../Raw datasets/submissions.csv')
comments = pd.read_csv('../Raw datasets/comments.csv')

In [15]:
ls ../Cleaned\ datasets

cmts_cleaned.csv           subs_cleaned.csv
cmts_cleaned_labelled.csv  tweets_cleaned.csv


In [17]:
submissions_cleaned = pd.read_csv('../Cleaned datasets/subs_cleaned.csv')
comments_cleaned = pd.read_csv('../Cleaned datasets/cmts_cleaned.csv')

In [19]:
comments_cleaned

Unnamed: 0.1,Unnamed: 0,id,text,subreddit,author,upvotes,downvotes,score,created_utc,url,link,parent_id,date,cleaned_text
0,0,jdm98sm,"Mmmm yes, forward thinking, like underground t...",elonmusk,ultimate_placeholder,1,0,1,1.679750e+09,https://www.reddit.com/r/elonmusk/comments/120...,https://www.quicktechnics.com/en/post/musk-den...,t1_jdl2saz,2023-03-25 13:19:51,mmmm yes forward thinking like underground tub...
1,1,jdm72zy,Probably not. It's like if attaching horse to ...,elonmusk,kroOoze,1,0,1,1.679749e+09,https://www.reddit.com/r/elonmusk/comments/121...,https://www.httnews.com/technology/eyi02xa6agb...,t3_1215ef2,2023-03-25 13:00:17,probably like attaching horse automobile would...
2,2,jdm2d94,Cool in any decade ever! A rocket going to spa...,elonmusk,swag_money69,1,0,1,1.679746e+09,https://www.reddit.com/r/elonmusk/comments/121...,https://v.redd.it/e72u906e8spa1,t3_12150g4,2023-03-25 12:13:56,cool decade ever rocket going space come would...
3,3,jdlxewi,"He is on of the most famous people out there, ...",elonmusk,jordanhanson,1,0,1,1.679743e+09,https://www.reddit.com/r/elonmusk/comments/zg1...,https://www.teslaoracle.com/2022/12/07/elon-mu...,t1_izhfq3i,2023-03-25 11:17:36,famous people like steve jobs never used among...
4,4,jdltrsd,The **average** rent for an apartment in Los A...,elonmusk,OGquaker,1,0,1,1.679740e+09,https://www.reddit.com/r/elonmusk/comments/11r...,https://cleanenergyrevolution.co/2023/03/15/te...,t1_jcbl54o,2023-03-25 10:29:27,average rent apartment los angeles 2786 month ...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
630,657,jcqspkf,It will be the Elon bot. Tesla is already run...,elonmusk,Whole-Mail2239,1,0,1,1.679174e+09,https://www.reddit.com/r/elonmusk/comments/11u...,https://elonmu.sh/,t1_jcp97pf,2023-03-18 21:05:48,elon bot tesla already run
631,658,jcqsehv,"I mean he's right, Elon doesn't seem like the ...",elonmusk,pyguy6,3,0,3,1.679173e+09,https://www.reddit.com/r/elonmusk/comments/11t...,https://www.httnews.com/business/elon-musk-spa...,t1_jcky3md,2023-03-18 21:03:37,mean hes right elon doesnt seem like best pers...
632,659,jcqsa8k,And his trans kid disowning him probably did a...,elonmusk,pyguy6,8,0,8,1.679173e+09,https://www.reddit.com/r/elonmusk/comments/11t...,https://www.httnews.com/business/elon-musk-spa...,t1_jcjk107,2023-03-18 21:02:36,trans kid disowning probably number
633,660,jcqs7mk,Therapy and counseling is a big part of gender...,elonmusk,pyguy6,2,0,2,1.679173e+09,https://www.reddit.com/r/elonmusk/comments/11t...,https://www.httnews.com/business/elon-musk-spa...,t1_jcj777s,2023-03-18 21:02:05,therapy counseling big part gender affirming c...


In [9]:
submissions = submissions.drop(columns=['Unnamed: 0'])
submissions.dtypes

id               object
title            object
text             object
subreddit        object
author           object
upvotes           int64
downvotes         int64
upvote_ratio    float64
score             int64
created_utc     float64
url              object
link             object
tags             object
num_comments      int64
dtype: object

In [10]:
comments = comments.drop(columns=['Unnamed: 0'])
comments.dtypes

id              object
text            object
subreddit       object
author          object
upvotes          int64
downvotes        int64
score            int64
created_utc    float64
url             object
link            object
parent_id       object
dtype: object

In [11]:
combined = pd.read_csv('../combined.csv')

In [12]:
combined

Unnamed: 0,platform_id,date,text,cleaned_text,title,post_text,subreddit,author,upvote_ratio,score,retweet_count,like_count,url,link,num_comments,tags,source,subjectivity,sentiment
0,12150g4,2023-03-25 00:46:43,Liftoff.54321.,liftoff54321,Liftoff.54321.,,elonmusk,ZaroonKhan5,0.91,290.0,,,reddit.com/r/elonmusk/comments/12150g4/liftoff...,https://v.redd.it/e72u906e8spa1,18.0,['SpaceX'],reddit_sub,,
1,120n1qh,2023-03-24 15:04:18,Musk denies multibillion investment in SpaceX ...,musk denies multibillion investment spacex sau...,Musk denies multibillion investment in SpaceX ...,,elonmusk,Alex_ZH1,0.92,339.0,,,reddit.com/r/elonmusk/comments/120n1qh/musk_de...,https://www.quicktechnics.com/en/post/musk-den...,69.0,['SpaceX'],reddit_sub,,
2,121ofyl,2023-03-25 14:56:49,"Elon Musk tells Twitter employees ""Office is n...",elon musk tells twitter employees office optio...,"Elon Musk tells Twitter employees ""Office is n...",,elonmusk,Express_Turn_5489,1.00,2.0,,,reddit.com/r/elonmusk/comments/121ofyl/elon_mu...,https://www.kumaonjagran.com/elon-musk-tells-t...,2.0,['Tweets'],reddit_sub,,
3,1215ef2,2023-03-25 01:01:24,Elon Musk believes Neuralink can help address ...,elon musk believes neuralink help address self...,Elon Musk believes Neuralink can help address ...,,elonmusk,Pawnti,0.61,6.0,,,reddit.com/r/elonmusk/comments/1215ef2/elon_mu...,https://www.httnews.com/technology/eyi02xa6agb...,11.0,['Tweets'],reddit_sub,,
4,11zpwfi,2023-03-23 16:41:50,"WHO Warns Of ""Fake News"" After Elon Musk Pande...",warns fake news elon musk pandemic treaty tweet,"WHO Warns Of ""Fake News"" After Elon Musk Pande...",,elonmusk,erinswider,0.86,281.0,,,reddit.com/r/elonmusk/comments/11zpwfi/who_war...,https://globenewsbulletin.com/world/who-warns-...,165.0,['Tweets'],reddit_sub,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12671,1639256815354404865,2023-03-24 21:24:15,We are damaging the environment. But not by us...,damaging environment using much energy using e...,,,,alexandre_lores,,,3.0,14.0,https://twitter.com/alexandre_lores/status/163...,,,,twitter,,
12672,1639256812741357571,2023-03-24 21:24:14,@elonmusk Yas😊😊 very good,elonmusk yas smilingfacewithsmilingeyes smilin...,,,,ashutos07601960,,,0.0,0.0,https://twitter.com/ashutos07601960/status/163...,,,,twitter,,
12673,1639256808979337216,2023-03-24 21:24:13,"@runews @elonmusk ""became aware"" you, nasty pu...",runews elonmusk became aware nasty pulitzer,,,,pates_karbo,,,0.0,0.0,https://twitter.com/pates_karbo/status/1639256...,,,,twitter,,
12674,1639256805673975808,2023-03-24 21:24:13,@luffysmayie Elon Musk sucks so bad,luffysmayie elon musk sucks bad,,,,IIuffy,,,0.0,0.0,https://twitter.com/IIuffy/status/163925680567...,,,,twitter,,


In [20]:
combined.dtypes

platform_id       object
date              object
text              object
cleaned_text      object
title             object
post_text         object
subreddit         object
author            object
upvote_ratio     float64
score            float64
retweet_count    float64
like_count       float64
url               object
link              object
num_comments     float64
tags              object
source            object
subjectivity     float64
sentiment        float64
dtype: object

### Combined Attributes
- platform_id      : tweet_id, submission_id, comment_id     
- date             : created at date (standardised in utc)
- text             : tweet body, comment body, submission title + body
- title            : (reddit sub only) reddit submission title
- post_text        : (reddit sub only) reddit submission body
- subreddit        : (reddit only) reddit submission/comment subreddit
- author           : author username
- upvote_ratio     : (reddit sub only) reddit upvote ratio
- net_upvotes      : (reddit only) reddit net upvotes (previously 'score')
- retweet_count    : (twitter only) twitter retweets count
- like_count       : (twitter only) twitter likes count
- url              : url of original post
- link             : (reddit only) link attached in post - usually image links
- num_comments     : (reddit only) number of comments of post
- tags             : (reddit only) tags attached to post
- source           : source of post - reddit_sub, reddit_cmt, twitter

In [23]:
import pandas as pd
import nltk
import re

# define a tokenizer function
tokenizer = nltk.tokenize.word_tokenize

# tokenize the sentences and count the number of tokens
combined['tokens'] = combined['text'].apply(lambda x: tokenizer(x))
combined['string_tokens'] = combined['tokens'].apply(lambda x: [t for t in x if re.match('^[a-zA-Z]+$', t)])
combined['num_tokens'] = combined['string_tokens'].apply(lambda x: len(x))

# get the total number of tokens across all sentences
total_tokens = combined['num_tokens'].sum()

# print the total number of tokens
print(total_tokens)

230988


In [28]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()


combined['stemmed_tokens'] = combined['string_tokens'].apply(lambda x: [stemmer.stem(t) for t in x])
unique_stems = pd.Series(combined['stemmed_tokens'].sum()).unique()

In [27]:
print(unique_stems)


13273
13273


In [29]:
print(unique_stems)

['musk' 'deni' 'multibillion' ... 'decapit' 'potter' 'luffysmayi']


In [30]:
len(unique_stems)

13273