## PRAW

Python Reddit API Wrapper

- limitations: cant access all subs and comments in the subreddit - only can access new, hot, top, controversial, gilded. 
- even when limit is set at 5000, doesnt even reach 1k results

https://praw.readthedocs.io/en/v7.7.0/code_overview/models/subreddit.html#praw.models.Subreddit

### Submissions Attributes

#### Scraped:
- `id`: post id,
- `title`: post title,
- `text`: post selftext (body),
- `subreddit`: post subreddit,
- `author`: post author username,
- `upvotes`: post upvotes,
- `downvotes`: post downvotes, # might have some problems
- `upvote_ratio`: post upvote_ratio,
- `score`: post score, # net upvotes
- `created_utc`: post created date,
- `url`: post permalink,
- `link`: post url, # photos/videos/media
- `tags`: post tags,
- `num_comments`: post num_comments

### Comments Attributes

#### Scraped
- `id`: post id,
- `text`: post text,
- `subreddit`: post subreddit,           
- `author`: post author username,
- `upvotes`: post upvotes,
- `downvotes`: post downvotes,
- `score`: post score, # net upvotes
- `created_utc`: post created date,
- `url`: post permalink,
- `link`: post.link_url,
- `parent_id`: parent post

In [2]:
import praw
from dotenv import dotenv_values
import pandas as pd

# retrieve dotenv config
config = dotenv_values(".env")

reddit = praw.Reddit(
    client_id = config['CLIENT_ID'],
    client_secret = config['CLIENT_SECRET'],
    user_agent = config['USER_AGENT'],
    
)

In [3]:
def get_posts(generator):
    res_list = []
    for i, post in enumerate(generator):
        sub = {
                'id': post.id,
                'title': post.title,
                'text': post.selftext,
                'subreddit': post.subreddit.display_name,
                'author': post.author.name if post.author is not None else None,
                'upvotes': post.ups,
                'downvotes': post.downs,
                'upvote_ratio': post.upvote_ratio,
                'score': post.score,
                'created_utc': post.created_utc,
                'url': 'reddit.com'+ post.permalink,
                'link': post.url,
                'tags': post.link_flair_richtext,
                'num_comments': post.num_comments
            }
        
        res_list.append(sub)
        
    return res_list
        
def get_comments(generator):
    res_list = []
    
    for i, post in enumerate(generator):
        cmt = {
                'id': post.id,
                'text': post.body,
                'subreddit': post.subreddit.display_name,
                'author': post.author.name if post.author is not None else None,
                'upvotes': post.ups,
                'downvotes': post.downs,
                'score': post.score,
                'created_utc': post.created_utc,
                'url': post.link_permalink,
                'link': post.link_url,
                'parent_id': post.parent_id
            }
        
        res_list.append(cmt)
        
    return res_list

In [4]:
# generator objects
hot = reddit.subreddit('elonmusk').hot(limit=1000)
new = reddit.subreddit('elonmusk').new(limit=1000)
top = reddit.subreddit('elonmusk').top(limit=1000)
all_subreds = reddit.subreddit('all').search('elon musk', limit=5000)

comments = reddit.subreddit('elonmusk').comments(limit=1000)

In [11]:
hot1 = reddit.subreddit('elonmusk').hot(limit=1000)
sub = next(hot1)

#view attributes - lmk if there are other attributes we need
sub.__dict__

{'comment_limit': 2048,
 'comment_sort': 'confidence',
 '_reddit': <praw.reddit.Reddit at 0x106aee510>,
 'approved_at_utc': None,
 'subreddit': Subreddit(display_name='elonmusk'),
 'selftext': '',
 'author_fullname': 't2_afjb8oz6',
 'saved': False,
 'mod_reason_title': None,
 'gilded': 0,
 'clicked': False,
 'title': 'Liftoff.54321.',
 'link_flair_richtext': [{'e': 'text', 't': 'SpaceX'}],
 'subreddit_name_prefixed': 'r/elonmusk',
 'hidden': False,
 'pwls': 6,
 'link_flair_css_class': '',
 'downs': 0,
 'thumbnail_height': 78,
 'top_awarded_type': None,
 'hide_score': False,
 'name': 't3_12150g4',
 'quarantine': False,
 'link_flair_text_color': 'light',
 'upvote_ratio': 0.92,
 'author_flair_background_color': None,
 'ups': 331,
 'total_awards_received': 0,
 'media_embed': {},
 'thumbnail_width': 140,
 'author_flair_template_id': None,
 'is_original_content': False,
 'user_reports': [],
 'secure_media': {'reddit_video': {'bitrate_kbps': 2400,
   'fallback_url': 'https://v.redd.it/e72u906

In [5]:
# parse generator objects
hotlist = get_posts(hot)
newlist = get_posts(new)
toplist = get_posts(top)
all_sub_list = get_posts(all_subreds)

In [6]:
comment_list = get_comments(comments)

In [7]:
comment_df = pd.DataFrame(comment_list)
comment_df

Unnamed: 0,id,text,subreddit,author,upvotes,downvotes,score,created_utc,url,link,parent_id
0,jdmrbax,Sounds like he’s controlling. I go to my offic...,elonmusk,mari815,1,0,1,1.679759e+09,https://www.reddit.com/r/elonmusk/comments/121...,https://www.kumaonjagran.com/elon-musk-tells-t...,t3_121ofyl
1,jdmpl8k,Get their butts into work and out of those paj...,elonmusk,Drone_Boss,1,0,1,1.679758e+09,https://www.reddit.com/r/elonmusk/comments/121...,https://www.kumaonjagran.com/elon-musk-tells-t...,t3_121ofyl
2,jdmoyi0,I did a trial. I went to the office for 5 days...,elonmusk,thethreat88,1,0,1,1.679758e+09,https://www.reddit.com/r/elonmusk/comments/121...,https://www.kumaonjagran.com/elon-musk-tells-t...,t3_121ofyl
3,jdmn89u,Boss up,elonmusk,Zealousideal-Ice9173,1,0,1,1.679757e+09,https://www.reddit.com/r/elonmusk/comments/121...,https://www.kumaonjagran.com/elon-musk-tells-t...,t1_jdmm239
4,jdmm239,"""Boss says going to work not optional"" we need...",elonmusk,Aquada_,1,0,1,1.679757e+09,https://www.reddit.com/r/elonmusk/comments/121...,https://www.kumaonjagran.com/elon-musk-tells-t...,t3_121ofyl
...,...,...,...,...,...,...,...,...,...,...,...
654,jcqzbcd,"If you want to improve the medical system, the...",elonmusk,pyguy6,5,0,5,1.679177e+09,https://www.reddit.com/r/elonmusk/comments/11t...,https://www.httnews.com/business/elon-musk-spa...,t1_jcqye7y
655,jcqye7y,Prescribed off lable. And kids and teenagers a...,elonmusk,Least777,0,0,0,1.679176e+09,https://www.reddit.com/r/elonmusk/comments/11t...,https://www.httnews.com/business/elon-musk-spa...,t1_jcqvoyn
656,jcqxbke,She is spot on.mate,elonmusk,ZaroonKhan5,2,0,2,1.679176e+09,https://www.reddit.com/r/elonmusk/comments/11t...,https://v.redd.it/jkw5p81147oa1,t1_jcnlhw8
657,jcqvoyn,"Thing is puberty is also irreversible, which i...",elonmusk,pyguy6,5,0,5,1.679175e+09,https://www.reddit.com/r/elonmusk/comments/11t...,https://www.httnews.com/business/elon-musk-spa...,t1_jcqufyl


In [None]:
# comment_df.to_csv('Raw datasets/comments.csv')

In [8]:
combinedlist = []
combinedlist.extend(hotlist)
combinedlist.extend(newlist)
combinedlist.extend(toplist)
combinedlist.extend(all_sub_list)
len(combinedlist)

2672

In [9]:
df = pd.DataFrame(combinedlist)
df

Unnamed: 0,id,title,text,subreddit,author,upvotes,downvotes,upvote_ratio,score,created_utc,url,link,tags,num_comments
0,12150g4,Liftoff.54321.,,elonmusk,ZaroonKhan5,324,0,0.92,324,1.679705e+09,reddit.com/r/elonmusk/comments/12150g4/liftoff...,https://v.redd.it/e72u906e8spa1,"[{'e': 'text', 't': 'SpaceX'}]",18
1,121ofyl,"Elon Musk tells Twitter employees ""Office is n...",,elonmusk,Express_Turn_5489,6,0,0.67,6,1.679756e+09,reddit.com/r/elonmusk/comments/121ofyl/elon_mu...,https://www.kumaonjagran.com/elon-musk-tells-t...,"[{'e': 'text', 't': 'Tweets'}]",10
2,120n1qh,Musk denies multibillion investment in SpaceX ...,,elonmusk,Alex_ZH1,349,0,0.92,349,1.679670e+09,reddit.com/r/elonmusk/comments/120n1qh/musk_de...,https://www.quicktechnics.com/en/post/musk-den...,"[{'e': 'text', 't': 'SpaceX'}]",69
3,1215ef2,Elon Musk believes Neuralink can help address ...,,elonmusk,Pawnti,6,0,0.61,6,1.679706e+09,reddit.com/r/elonmusk/comments/1215ef2/elon_mu...,https://www.httnews.com/technology/eyi02xa6agb...,"[{'e': 'text', 't': 'Tweets'}]",12
4,11zpwfi,"WHO Warns Of ""Fake News"" After Elon Musk Pande...",,elonmusk,erinswider,281,0,0.86,281,1.679590e+09,reddit.com/r/elonmusk/comments/11zpwfi/who_war...,https://globenewsbulletin.com/world/who-warns-...,"[{'e': 'text', 't': 'Tweets'}]",165
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2667,11er3dn,Let’s see how his free speech pans out: China’...,Smart money’s on him shutting up fairly quickly,RealTwitterAccounts,defectivetrashdetect,1057,0,0.97,1057,1.677636e+09,reddit.com/r/RealTwitterAccounts/comments/11er...,https://www.cnbc.com/2023/02/28/chinas-ccp-war...,"[{'e': 'text', 't': 'Political™'}]",95
2668,120cvcv,Texans blast Elon Musk's Boring plan to dump w...,,RealTesla,Zorkmid123,274,0,0.95,274,1.679643e+09,reddit.com/r/RealTesla/comments/120cvcv/texans...,https://www.foxnews.com/us/texans-blast-elon-m...,[],51
2669,10efs58,"For Glass Onion, Rian Johnson convinced elon m...",,shittymoviedetails,of_kilter,16439,0,0.91,16439,1.673971e+09,reddit.com/r/shittymoviedetails/comments/10efs...,https://i.redd.it/mrjmj1f34oca1.jpg,[],294
2670,zsghfu,Elon Musk getting owned by a former Twitter en...,,facepalm,eichenes,10693,0,0.95,10693,1.671696e+09,reddit.com/r/facepalm/comments/zsghfu/elon_mus...,https://v.redd.it/azr15yt2qe7a1,"[{'a': ':Misc:', 'e': 'emoji', 'u': 'https://e...",1031


In [10]:
# remove duplicates
df = df.loc[df.astype(str).drop_duplicates('id').index]
df

Unnamed: 0,id,title,text,subreddit,author,upvotes,downvotes,upvote_ratio,score,created_utc,url,link,tags,num_comments
0,12150g4,Liftoff.54321.,,elonmusk,ZaroonKhan5,324,0,0.92,324,1.679705e+09,reddit.com/r/elonmusk/comments/12150g4/liftoff...,https://v.redd.it/e72u906e8spa1,"[{'e': 'text', 't': 'SpaceX'}]",18
1,121ofyl,"Elon Musk tells Twitter employees ""Office is n...",,elonmusk,Express_Turn_5489,6,0,0.67,6,1.679756e+09,reddit.com/r/elonmusk/comments/121ofyl/elon_mu...,https://www.kumaonjagran.com/elon-musk-tells-t...,"[{'e': 'text', 't': 'Tweets'}]",10
2,120n1qh,Musk denies multibillion investment in SpaceX ...,,elonmusk,Alex_ZH1,349,0,0.92,349,1.679670e+09,reddit.com/r/elonmusk/comments/120n1qh/musk_de...,https://www.quicktechnics.com/en/post/musk-den...,"[{'e': 'text', 't': 'SpaceX'}]",69
3,1215ef2,Elon Musk believes Neuralink can help address ...,,elonmusk,Pawnti,6,0,0.61,6,1.679706e+09,reddit.com/r/elonmusk/comments/1215ef2/elon_mu...,https://www.httnews.com/technology/eyi02xa6agb...,"[{'e': 'text', 't': 'Tweets'}]",12
4,11zpwfi,"WHO Warns Of ""Fake News"" After Elon Musk Pande...",,elonmusk,erinswider,281,0,0.86,281,1.679590e+09,reddit.com/r/elonmusk/comments/11zpwfi/who_war...,https://globenewsbulletin.com/world/who-warns-...,"[{'e': 'text', 't': 'Tweets'}]",165
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2667,11er3dn,Let’s see how his free speech pans out: China’...,Smart money’s on him shutting up fairly quickly,RealTwitterAccounts,defectivetrashdetect,1057,0,0.97,1057,1.677636e+09,reddit.com/r/RealTwitterAccounts/comments/11er...,https://www.cnbc.com/2023/02/28/chinas-ccp-war...,"[{'e': 'text', 't': 'Political™'}]",95
2668,120cvcv,Texans blast Elon Musk's Boring plan to dump w...,,RealTesla,Zorkmid123,274,0,0.95,274,1.679643e+09,reddit.com/r/RealTesla/comments/120cvcv/texans...,https://www.foxnews.com/us/texans-blast-elon-m...,[],51
2669,10efs58,"For Glass Onion, Rian Johnson convinced elon m...",,shittymoviedetails,of_kilter,16439,0,0.91,16439,1.673971e+09,reddit.com/r/shittymoviedetails/comments/10efs...,https://i.redd.it/mrjmj1f34oca1.jpg,[],294
2670,zsghfu,Elon Musk getting owned by a former Twitter en...,,facepalm,eichenes,10693,0,0.95,10693,1.671696e+09,reddit.com/r/facepalm/comments/zsghfu/elon_mus...,https://v.redd.it/azr15yt2qe7a1,"[{'a': ':Misc:', 'e': 'emoji', 'u': 'https://e...",1031


In [None]:
# df.to_csv('Raw datasets/submissions.csv')