<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# Lab 3.2.2
# *Mining Social Media on Reddit*

## The Reddit API and the PRAW Package

The Reddit API is rich and complex, with many endpoints (https://www.reddit.com/dev/api/). It includes methods for navigating its collections, which include various kinds of media as well as comments. Fortunately, the Python library PRAW reduces much of this complexity.

Reddit requires developers to create and authenticate an app before they can use the API, but the process is much less onerous than some, and does not have waiting period for approval of new developers.

### 1. Create a Reddit App

Go to https://www.reddit.com/prefs/apps and click "create an app".

Enter the following in the form:

- a name for your app
- select "script" radio button
- a description
- a redirect URI

(Nb. For pulling data into a data science experiment, a local port can be used for the Redirect URI; try http://127.0.0.1:1410)


- click "create app"
- from the form that displays, copy the following to a local text file (or to this notebook):

  - name (the name you gave to your app)
  - redirect URI
  - personal use script (this is your OAuth 2 Client ID)
  - secret (this is your OAuth 2 Secret)

### 2. Register for API Access

- follow the link at https://www.reddit.com/wiki/api and read the terms of use for Reddit API access
- fill in the form fields at the bottom
  - make sure to enter your new OAuth Client ID where indicated
  - your use case could be something like "Training in API usage for data science projects"
  - your platform could be something like "Jupyter Notebooks / Python"
  
- click "SUBMIT"

- when asked for User-Agent, enter something that fits this pattern:
  `your_os-python:your_reddit_appname:v1.0 (by /u/your_reddit_username)`

### 3. Load Python Libraries

In [None]:
# !pip install praw

In [1]:
import praw
import requests
import json
import pprint
from datetime import datetime, date, time

### 4. Authenticate from your Python script

You could assign your authentication details explicitly, as follows:

In [None]:
my_user_agent = 'python:praw practice:v1.0 (by /u/GrapefruitKey6917)'    # your user Agent string goes in here
my_client_id = ''   # your Client ID string goes in here
my_client_secret = ''   # your Secret string goes in here



A better way would be to store these details externally, so they are not displayed in the notebook:

- create a file called "auth_reddit.json" in your "notebooks" directory, and save your credentials there in JSON format:

`{   "my_client_id": "your Client ID string goes in here",` <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;` "my_client_secret": "your Secret string goes in here",` <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`"my_user_agent": "your user Agent string goes in here"` <br>
`}`

- Double quotes only ( `"` not `'` )
- commas after first two items but not the last

Use the following code to load the credentials:  

In [2]:
pwd()  # make sure your working directory is where the file is

'c:\\IOD\\Module_3'

In [8]:
path_auth = 'auth_reddit.json'
auth = json.loads(open(path_auth).read())
pp = pprint.PrettyPrinter(indent=4)

# For debugging only:
pp.pprint(auth)

{   'my_client_id': 'O8HlIJDrCPAiJN6WIRqwow',
    'my_client_secret': 'sQ4NvmC7tAKv8pIOSX8H_ohvn_v_5g',
    'my_user_agent': 'python:praw practice:v1.0 (by /u/GrapefruitKey6917)'}


In [9]:
my_user_agent = auth['my_user_agent']
my_client_id = auth['my_client_id']
my_client_secret = auth['my_client_secret']

Security considerations:
- this method only keeps your credentials invisible as long as nobody else gets access to this notebook file
- if you wanted another user to have access to the executable notebook without divulging your credentials you should set up an OAuth 2.0 workflow to let them obtain and apply their own API tokens when using your app
    - ❗wtf does this mean
- if you just want to share your analyses, you could use a separate script (which you don't share) to fetch the data and save it locally, then use a second notebook (with no API access) to load and analyze the locally stored data

### 5. Exploring the API

Here is how to connect to Reddit with read-only access:

In [10]:
reddit = praw.Reddit(client_id = my_client_id,
                     client_secret = my_client_secret,
                     user_agent = my_user_agent)

print('Read-only = ' + str(reddit.read_only))  # Output: True

Read-only = True


In the next cell, put the cursor after the '.' and hit the [tab] key to see the available members and methods in the response object:

In [None]:
# reddit.


That didn't work but this did:

In [19]:
dir(reddit)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_authorized_core',
 '_check_for_async',
 '_check_for_update',
 '_core',
 '_handle_rate_limit',
 '_next_unique',
 '_objectify_request',
 '_objector',
 '_prepare_common_authorizer',
 '_prepare_objector',
 '_prepare_prawcore',
 '_prepare_trusted_prawcore',
 '_prepare_untrusted_prawcore',
 '_ratelimit_regex',
 '_read_only_core',
 '_token_manager',
 '_unique_counter',
 '_validate_on_submit',
 'auth',
 'comment',
 'config',
 'delete',
 'domain',
 'drafts',
 'front',
 'get',
 'inbox',
 'info',
 'live',
 'multireddit',
 'notes',
 'patch',
 'post',
 'put',
 'random_subreddit',
 'read_only',
 'redditor',

In [13]:
subreddit_name = 'malaysia'
subreddit = reddit.subreddit(subreddit_name)

In [14]:
comments = []
for comment in subreddit.comments(limit=1000):
    comments.append(comment)

In [15]:
reddit.subreddit(subreddit_name)

Subreddit(display_name='malaysia')

In [18]:
dir(subreddit)

['MESSAGE_PREFIX',
 'STR_FIELD',
 'VALID_TIME_FILTERS',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_convert_to_fancypants',
 '_create_or_update',
 '_fetch',
 '_fetch_data',
 '_fetch_info',
 '_fetched',
 '_kind',
 '_parse_xml_response',
 '_path',
 '_prepare',
 '_read_and_post_media',
 '_reddit',
 '_reset_attributes',
 '_safely_add_arguments',
 '_submission_class',
 '_submit_media',
 '_subreddit_collections_class',
 '_subreddit_list',
 '_upload_inline_media',
 '_upload_media',
 '_url_parts',
 '_validate_gallery',
 '_validate_inline_media',
 '_validate_time_filter',
 'banned',
 'collections',
 'comments',
 'contributor',
 'controversial',
 'display_n

In [23]:
# for item in dir(subreddit):
#     if not item.startswith("_"):
#         print(item)

of_interest = [item for item in dir(subreddit) if not item.startswith('_') ]
len(of_interest) # 144
print(of_interest)

['MESSAGE_PREFIX', 'STR_FIELD', 'VALID_TIME_FILTERS', 'accept_followers', 'accounts_active', 'accounts_active_is_fuzzed', 'active_user_count', 'advertiser_category', 'all_original_content', 'allow_discovery', 'allow_galleries', 'allow_images', 'allow_polls', 'allow_prediction_contributors', 'allow_predictions', 'allow_predictions_tournament', 'allow_talks', 'allow_videogifs', 'allow_videos', 'allowed_media_in_comments', 'banned', 'banner_background_color', 'banner_background_image', 'banner_img', 'banner_size', 'can_assign_link_flair', 'can_assign_user_flair', 'collapse_deleted_comments', 'collections', 'comment_contribution_settings', 'comment_score_hide_mins', 'comments', 'community_icon', 'community_reviewed', 'contributor', 'controversial', 'created', 'created_utc', 'description', 'description_html', 'disable_contributor_requests', 'display_name', 'display_name_prefixed', 'emoji', 'emojis_custom_size', 'emojis_enabled', 'filters', 'flair', 'free_form_reports', 'fullname', 'gilded',

In [17]:
top_posts = subreddit.top(limit=5)

Consult the PRAW and Reddit API documentation. Print a few of the response members below:

In [None]:
"""
comment
random_subreddit
submission
subreddit
user
"""

Content in Reddit is grouped by topics called "subreddits". Content, called "submissions" *[aka "posts"]*, is fetched by calling the `subreddit` method of the connection object (which is our `reddit` variable) with an argument that matches an actual topic.

We also need to append *[chain]* a further method call to a "subinstance", such as one of the following:

- controversial
- gilded
- hot
- new
- rising
- top

One of the submission objects members is `title`. Fetch and print 10 submission titles from the 'learnpython' subreddit using one of the subinstances above:

In [24]:
for submission in reddit.subreddit('learnpython').hot(limit=10):
    print(submission.title)

Ask Anything Monday - Weekly Thread
Learn to code 
Changing a set with -ve numbers to list
Pyhthon PCAP & PCEP training  in Athens, Greece
Career crisis involving Python 
Problem iterating over a list of namedtuples
Skipping "Before you continue to youtube" message
Multiprocessing slowing down with more process
Need Help Automating a Website - Issues with Selenium and PyAutoGUI
A Python Programming Roadmap


Now retrieve 10 authors:

In [25]:
for submission in reddit.subreddit('learnpython').hot(limit=10):
    print(submission.author)

AutoModerator
dumdum101704
No_Prize_120
No_Lingonberry1481
rogue_lash
steakhutzeee
ST4N4R
TrojanTrash
yeah280
bbroy4u


Note that we obtained the titles and authors from separate API calls. Can we expect these to correspond to the same submissions? If not, how could we gurantee that they do?

In [28]:
submissions=reddit.subreddit('learnpython')
for post in submissions.hot(limit=10):
    print(f"{post.title} ({post.author})")

Ask Anything Monday - Weekly Thread (AutoModerator)
Learn to code  (dumdum101704)
Changing a set with -ve numbers to list (No_Prize_120)
Pyhthon PCAP & PCEP training  in Athens, Greece (No_Lingonberry1481)
Career crisis involving Python  (rogue_lash)
Problem iterating over a list of namedtuples (steakhutzeee)
Skipping "Before you continue to youtube" message (ST4N4R)
Multiprocessing slowing down with more process (TrojanTrash)
Need Help Automating a Website - Issues with Selenium and PyAutoGUI (yeah280)
A Python Programming Roadmap (bbroy4u)


Why doesn't the next cell produce output?

In [None]:
for submission in submissions:
    print(submission.comments)

TypeError: ignored

It doesn't product output because `sumission` as defined is the actual subreddit itself.  You have to go a level deeper to actually get the posts, e.g., `submissions.hot()`

Print two comments associated with each of these submissions:

In [29]:
submissions = reddit.subreddit('learnpython').hot(limit=10)
for submission in submissions:
    top_level_comments = list(submission.comments)
    all_comments = submission.comments.list()[:2]
    for comment in all_comments:
        print(comment.body)

Impact of AI
Can someone help me for creating the code for Finite Difference Method
Makes sense because that was me about a couple of weeks back. Total 100% honesty here:

I picked a place (codechef) and just launched into it and have been going steady since. There are days where I do not feel like doing any coding, THAT IS WHEN YOU PUSH and do it anyway. Repeat lessons if you have to, but don't ever skip a day, always be making progress, even .01% is moving forward.

Do not amass a bunch of resources, I did that and I ended up sitting on a tresure trove of courses for 5 years and wasted valuable time seeking new advice without actually needing it.

I apologize if I sound curt or judgy, I am just trying to help people not repeat what I did in the beginning
Get enough sleep, learn in the morning when you're fresh and rested, get coffee.
Sets are not ordered. If this ever works, it's only by coincidence.

If you need to sort, you should do so explicitly:

    list(sorted(set(my_list)))
>

: 

Referring to the API documentation, explore the submissions object and print some interesting data:

**Interesting data:** (Some) methods that return posts (ListingGenerators):
- `controversial()`
- `gilded()`
- `hot()`
- `new()`
- `rising()` and `random_rising`
- `random`


#### Posting to Reddit

To be able to post to your Reddit account (i.e. contribute submissions), you need to connect to the API with read/write privilege. This requires an *authorized instance*, which is obtained by including your Reddit user name and password in the connection request:

In [None]:
reddit = praw.Reddit(client_id='my client id',
                     client_secret='my client secret',
                     user_agent='my user agent',
                     username='my username',
                     password='my password')
print(reddit.read_only)  # Output: False

False


You could hide these last two credentials by adding them to your JSON file and then reading all five values at once.

>
>


>
>




---



---



> > > > > > > > > © 2024 Institute of Data


---



---



