<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# Lab 3.2.2
# *Mining Social Media on Reddit*

## The Reddit API and the PRAW Package

The Reddit API is rich and complex, with many endpoints (https://www.reddit.com/dev/api/). It includes methods for navigating its collections, which include various kinds of media as well as comments. Fortunately, the Python library PRAW reduces much of this complexity.

Reddit requires developers to create and authenticate an app before they can use the API, but the process is much less onerous than some, and does not have waiting period for approval of new developers.

### 1. Create a Reddit App

Go to https://www.reddit.com/prefs/apps and click "create an app".

Enter the following in the form:

- a name for your app
- select "script" radio button
- a description
- a redirect URI

(Nb. For pulling data into a data science experiment, a local port can be used for the Redirect URI; try http://127.0.0.1:1410)


- click "create app"
- from the form that displays, copy the following to a local text file (or to this notebook):

  - name (the name you gave to your app)
  - redirect URI
  - personal use script (this is your OAuth 2 Client ID)
  - secret (this is your OAuth 2 Secret)

### 2. Register for API Access

- follow the link at https://www.reddit.com/wiki/api and read the terms of use for Reddit API access
- fill in the form fields at the bottom
  - make sure to enter your new OAuth Client ID where indicated
  - your use case could be something like "Training in API usage for data science projects"
  - your platform could be something like "Jupyter Notebooks / Python"
  
- click "SUBMIT"

- when asked for User-Agent, enter something that fits this pattern:
  `your_os-python:your_reddit_appname:v1.0 (by /u/your_reddit_username)`

### 3. Load Python Libraries

In [2]:
!pip3 install --upgrade pip
!pip3 install praw

Collecting pip
  Downloading pip-24.1-py3-none-any.whl.metadata (3.6 kB)
Downloading pip-24.1-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m17.3 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.0
    Uninstalling pip-24.0:
      Successfully uninstalled pip-24.0
Successfully installed pip-24.1


In [3]:
import praw
import requests
import json
import pprint
from datetime import datetime, date, time

### 4. Authenticate from your Python script

You could assign your authentication details explicitly, as follows:

In [4]:
my_user_agent = 'ReflectionOk187'   # your user Agent string goes in here
my_client_id = 'NVBSkUAgnS5bh_kSaMIjLg'   # your Client ID string goes in here
my_client_secret = 'XxacRTP9NLreAu5dYWM3I_KImyud0A'   # your Secret string goes in here

A better way would be to store these details externally, so they are not displayed in the notebook:

- create a file called "auth_reddit.json" in your "notebooks" directory, and save your credentials there in JSON format:

`{   "my_client_id": "your Client ID string goes in here",` <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;` "my_client_secret": "your Secret string goes in here",` <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`"my_user_agent": "your user Agent string goes in here"` <br>
`}`

Use the following code to load the credentials:  

In [5]:
pwd()  # make sure your working directory is where the file is

'/Users/pnandan/Documents/Data-Science-Project/Module-3'

In [7]:
path_auth = 'auth_reddit.json'
auth = json.loads(open(path_auth).read())
pp = pprint.PrettyPrinter(indent=4)
# For debugging only:

pp.pprint(auth)




{   'my_client_id': 'NVBSkUAgnS5bh_kSaMIjLg',
    'my_client_secret': 'XxacRTP9NLreAu5dYWM3I_KImyud0A',
    'my_user_agent': 'ReflectionOk187'}


In [8]:
my_user_agent = auth['my_user_agent']
my_client_id = auth['my_client_id']
my_client_secret = auth['my_client_secret']

Security considerations:
- this method only keeps your credentials invisible as long as nobody else gets access to this notebook file
- if you wanted another user to have access to the executable notebook without divulging your credentials you should set up an OAuth 2.0 workflow to let them obtain and apply their own API tokens when using your app
- if you just want to share your analyses, you could use a separate script (which you don't share) to fetch the data and save it locally, then use a second notebook (with no API access) to load and analyse the locally stored data

### 5. Exploring the API

Here is how to connect to Reddit with read-only access:

In [12]:
reddit = praw.Reddit(client_id = my_client_id,
                     client_secret = my_client_secret,
                     user_agent = my_user_agent)

print('Read-only = ' + str(reddit.read_only))  # Output: True

Read-only = True


In the next cell, put the cursor after the '.' and hit the [tab] key to see the available members and methods in the response object:

In [11]:
subreddit_name = 'malaysia'
subreddit = reddit.subreddit(subreddit_name)

In [13]:
comments = []
for comment in subreddit.comments(limit=1000):
    comments.append(comment)

In [14]:
print(len(comments))

918


In [16]:
reddit.subreddit(subreddit_name)

Subreddit(display_name='malaysia')

Consult the PRAW and Reddit API documentation. Print a few of the response members below:

In [17]:
for comment in comments:
    print(f"Comment ID: {comment.id}")
    print(f"Author: {comment.author}")
    print(f"Body: {comment.body}")
    print(f"Score: {comment.score}")
    print("-" * 20)

Comment ID: l9vemkb
Author: ProbablyWorking
Body: Aigh't bro. Just stating my answer.
Score: 1
--------------------
Comment ID: l9vel60
Author: AlanDevonshire
Body: The shit that passes for a story in Malaysia never ceases to amaze me. 
Does your country have so little going on that soup and words that look like other words is worth all the fuss?
Score: 1
--------------------
Comment ID: l9vefq7
Author: GreeneValley
Body: For me, it’s the ecosystem (iPhone, iPad, Mac, Apple TV, HomePods, AirPods, Apple Watch.. soon, Vision Pro) that seamlessly work with each other. Some examples:

- Universal Control: Control multiple Macs/iPads with a keyboard and mouse/trackpad, can even drag n’ drop files between devices.
- (Cont.) use iPad as external displays
- Hand-off/Continuity: Continue what I was surfing/working on/listening to/reading/playing on another device
- (cont.) Take calls/reply messages on any devices
- AirPods (earphones/headphones) that seamlessly switch to whichever devices you’r

Content in Reddit is grouped by topics called "subreddits". Content, called "submissions", is fetched by calling the `subreddit` method of the connection object (which is our `reddit` variable) with an argument that matches an actual topic.

We also need to append a further method call to a "subinstance", such as one of the following:

- controversial
- gilded
- hot
- new
- rising
- top

One of the submission objects members is `title`. Fetch and print 10 submission titles from the 'learnpython' subreddit using one of the subinstances above:

In [18]:
for submission in reddit.subreddit('learnpython').hot(limit=10):
    print(submission.title)

Ask Anything Monday - Weekly Thread
Do Any Of You Feel The Fear Of forgetting when you're just starting out 
Learning Python
How to locate web elements using selenium to later use it in pyautogui.
Planning a program
My first solo project. (Looking for feedback)
Should I return None for an empty data array?
Python Classes and inheritance
Is there a simpler way to do this?
Advice on where to go from here


Now retrieve 10 authors:

In [19]:
for submission in reddit.subreddit('learnpython').hot(limit=10):
    print(submission.author)

AutoModerator
Weekly_Event_1969
Low_Mathematician571
Loud_Fisherman1862
DeanMalHanNJackIsms
MildlyAngryGoose
Ok-Frosting7364
Cocoatea57
math-nerd42
Choice_Shoulder2828


Note that we obtained the titles and authors from separate API calls. Can we expect these to correspond to the same submissions? If not, how could we gurantee that they do?

In [21]:
submissions=reddit.subreddit('learnpython')

Why doesn't the next cell produce output?

In [22]:
submissions = subreddit.hot(limit=10)

# Print the comments of each submission
for submission in submissions:
    print(f"Submission title: {submission.title}")
    submission.comments.replace_more(limit=0)  # Replace "more comments" with actual comments
    for comment in submission.comments.list():
        print(f"Comment by {comment.author}: {comment.body}")
        print("-" * 20)
    print("=" * 40)

Submission title: /r/Malaysia daily random discussion and quick questions thread for 23 June 2024
Comment by AutoModerator: 
**Minor announcements:**

* [monyet.cc](https://monyet.cc): Check out our Malaysian Lemmy community! ([why?](https://www.reddit.com/r/malaysia/comments/14cmnaj/rmalaysia_and_the_blackout/))
* [SPM Megathread](https://www.reddit.com/r/malaysia/comments/s58t8m/spm_megathread/): Updated 2022 with SPM resources such as trial papers, modules, notes and more!
* [Mental health wiki](https://www.reddit.com/r/malaysia/wiki/mental_health/): A list of mental health services in Malaysia
   


*I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/malaysia) if you have any questions or concerns.*
--------------------
Comment by SoulZoul: yo can y'all recommend me a good budget wireless gaming mouse under rm200.

mostly playing FPS games like val and csgo so quick and convenient dpi switching is impor

Print two comments associated with each of these submissions:

In [23]:
submissions = reddit.subreddit('learnpython').hot(limit=10)
for submission in submissions:
    top_level_comments = list(submission.comments)
    all_comments = submission.comments.list()[:2]
    for comment in all_comments:
        print(comment.body)

How come the code:
print("0123456789"[::2])
Prints 02468, meaning prints the first character and then in jumps of two.
But the code:
print("0123456789"[::-1])
Prints 9876543210, meaning jumps -1 characters first and only then prints?
Am I understanding something wrong here?
\[Pandas\] why do i have to put the ".index" in this line (data.drop(data\[data\['Weight'\] > 4000\].index, inplace=True) ) to make it work? I was trying to remove all rows with a Weight value that was bigger than 4000
It's a prt of learning and is fine. In this day and age of the internet if you forget it you can look it up again and as you practice using it you will have to look it up less and less often
I dont remember crap lol. it comes with experience, you remember the things you actually need eventually. Dont stress about it, just start doing and you'll see everything will sort it self out.
I like python crash course as an introduction. Teaches you the basics while working through a few different kinds of proj

Referring to the API documentation, explore the submissions object and print some interesting data:

In [25]:
for submission in submissions:
    print(f"Title: {submission.title}")
    print(f"Author: {submission.author}")
    print(f"Score: {submission.score}")
    print(f"ID: {submission.id}")
    print(f"URL: {submission.url}")
    print(f"Created: {submission.created_utc}")
    print(f"Number of Comments: {submission.num_comments}")
    print(f"Subreddit: {submission.subreddit}")
    print(f"Flair: {submission.link_flair_text}")
    print(f"Upvote Ratio: {submission.upvote_ratio}")
    print(f"Is Stickied: {submission.stickied}")
    print(f"Is NSFW: {submission.over_18}")
    print("=" * 40)

#### Posting to Reddit

To be able to post to your Reddit account (i.e. contribute submissions), you need to connect to the API with read/write privilege. This requires an *authorised instance*, which is obtained by including your Reddit user name and password in the connection request:

In [26]:
reddit = praw.Reddit(client_id='my client id',
                     client_secret='my client secret',
                     user_agent='my user agent',
                     username='my username',
                     password='my password')
print(reddit.read_only)  # Output: False

False


You could hide these last two credentials by adding them to your JSON file and then reading all five values at once.

In [30]:
with open('auth_reddit.json') as f:
    creds = json.load(f)

# Initialize the Reddit instance using the credentials
reddit = praw.Reddit(
    client_id=creds['my_client_id'],
    client_secret=creds['my_client_secret'],
    user_agent=creds['my_user_agent'],
    username=creds['username'],
    password=creds['password']
)

# Check if the Reddit instance is read-only
print(reddit.read_only)

# Fetch and print interesting data about submissions
subreddit_name = 'learnpython'
subreddit = reddit.subreddit(subreddit_name)

submissions = subreddit.hot(limit=5)  # Limiting to 5 for demonstration

for submission in submissions:
    print(f"Title: {submission.title}")
    print(f"Author: {submission.author}")
    print(f"Score: {submission.score}")
    print(f"ID: {submission.id}")
    print(f"URL: {submission.url}")
    print(f"Created: {submission.created_utc}")
    print(f"Number of Comments: {submission.num_comments}")
    print(f"Subreddit: {submission.subreddit}")
    print(f"Flair: {submission.link_flair_text}")
    print(f"Upvote Ratio: {submission.upvote_ratio}")
    print(f"Is Stickied: {submission.stickied}")
    print(f"Is NSFW: {submission.over_18}")
    print("=" * 40)

True
Title: Ask Anything Monday - Weekly Thread
Author: AutoModerator
Score: 3
ID: 1dhkxms
URL: https://www.reddit.com/r/learnpython/comments/1dhkxms/ask_anything_monday_weekly_thread/
Created: 1718582429.0
Number of Comments: 49
Subreddit: learnpython
Flair: None
Upvote Ratio: 1.0
Is Stickied: True
Is NSFW: False
Title: Do Any Of You Feel The Fear Of forgetting when you're just starting out 
Author: Weekly_Event_1969
Score: 19
ID: 1dm2948
URL: https://www.reddit.com/r/learnpython/comments/1dm2948/do_any_of_you_feel_the_fear_of_forgetting_when/
Created: 1719081212.0
Number of Comments: 17
Subreddit: learnpython
Flair: None
Upvote Ratio: 0.84
Is Stickied: False
Is NSFW: False
Title: Learning Python
Author: Low_Mathematician571
Score: 2
ID: 1dmecpn
URL: https://www.reddit.com/r/learnpython/comments/1dmecpn/learning_python/
Created: 1719118121.0
Number of Comments: 1
Subreddit: learnpython
Flair: None
Upvote Ratio: 1.0
Is Stickied: False
Is NSFW: False
Title: How to locate web elements us

>
>


>
>




---



---



> > > > > > > > > © 2024 Institute of Data


---



---



