<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# Lab 3.2.2 
# *Mining Social Media on Reddit*

## The Reddit API and the PRAW Package

The Reddit API is rich and complex, with many endpoints (https://www.reddit.com/dev/api/). It includes methods for navigating its collections, which include various kinds of media as well as comments. Fortunately, the Python library PRAW (which means Python Reddit API Wrapper) reduces much of this complexity.

Reddit requires developers to create and authenticate an app before they can use the API, but the process is much less onerous than some, and does not have waiting period for approval of new developers.

### 1. Create a Reddit App

Go to https://www.reddit.com/prefs/apps and click "create an app".

Enter the following in the form:

- a name for your app
- select "script" radio button
- a description
- a redirect URI

(Nb. For pulling data into a data science experiment, a local port can be used for the Redirect URI; try http://127.0.0.1:1410)


- click "create app"
- from the form that displays, copy the following to a local text file (or to this notebook):

  - name (the name you gave to your app)
  - redirect URI
  - personal use script (this is your OAuth 2 Client ID)
  - secret (this is your OAuth 2 Secret)

### 2. Register for API Access

- follow the link at https://www.reddit.com/wiki/api and read the terms of use for Reddit API access 
- fill in the form fields at the bottom 
  - make sure to enter your new OAuth Client ID where indicated
  - your use case could be something like "Training in API usage for data science projects"
  - your platform could be something like "Jupyter Notebooks / Python"
  
- click "SUBMIT"
 
- when asked for User-Agent, enter something that fits this pattern:
  `your_os-python:your_reddit_appname:v1.0 (by /u/your_reddit_username)`

### 3. Load Python Libraries

In [1]:
import praw # means Python Reddit API Wrapper
import requests
import json
import pprint # pretty print - prints in a more humanly friendly way
from datetime import datetime, date, time

### 4. Authenticate from your Python script

You could assign your authentication details explicitly, as follows:

In [None]:
my_user_agent = '' # your user Agent string goes in here
my_client_id = '' # your Client ID string goes in here
my_client_secret = '' # your Secret string goes in here

A better way would be to store these details externally, so they are not displayed in the notebook:

- create a file called "auth_reddit.json" in your "notebooks" directory, and save your credentials there in JSON format:

`{   "my_client_id": "your Client ID string goes in here",` <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;` "my_client_secret": "your Secret string goes in here",` <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`"my_user_agent": "your user Agent string goes in here"` <br>
`}`

Use the following code to load the credentials:  

In [None]:
pwd()  # make sure your working directory is where the file is

In [None]:
path_auth = 'auth_reddit.json'
auth = json.loads(open(path_auth).read())

In [None]:
# For debugging only:
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(auth)

In [3]:
my_user_agent = auth['my_user_agent']
my_client_id = auth['my_client_id']
my_client_secret = auth['my_client_secret']

Security considerations: 
- this method only keeps your credentials invisible as long as nobody else gets access to this notebook file 
- if you wanted another user to have access to the executable notebook without divulging your credentials you should set up an OAuth 2.0 workflow to let them obtain and apply their own API tokens when using your app
- if you just want to share your analyses, you could use a separate script (which you don't share) to fetch the data and save it locally, then use a second notebook (with no API access) to load and analyse the locally stored data

### 5. Exploring the API

Here is how to connect to Reddit with read-only access:

In [4]:
reddit = praw.Reddit(client_id = my_client_id, 
                     client_secret = my_client_secret, 
                     user_agent = my_user_agent)

print('Read-only = ' + str(reddit.read_only))  # Output: True

Read-only = True


In the next cell, put the cursor after the '.' and hit the [tab] key to see the available members and methods in the response object:

In [5]:
# Connect to a subreddit
subreddit_name = 'funny'
subreddit = reddit.subreddit(subreddit_name)

In [6]:
# Retrieve comments from subreddits
comments = []
for comment in subreddit.comments(limit=1000):
    comments.append(comment)

In [7]:
# Print first 5 comments
for comment in comments[:5]:
    print(comment.body)

That shit will only last as long as his mom thinks it's cute.
Frankly? I'm impressed.
Whoa, somebody had hay for dinner ..
Funny, or you just being ignorant to how things work?

Why is this dogshit post on my home page, it only has 200 upvotes? Reddit really sucks these days.
Perfect if every attacker wants to grab your shirt and hold on for dear life


In [8]:
# Check subreddit name
reddit.subreddit(subreddit_name)

Subreddit(display_name='funny')

Consult the PRAW and Reddit API documentation. Print a few of the response members below:

PRAW documentation:https://praw.readthedocs.io/en/stable/

Reddit API documentation: https://www.reddit.com/dev/api/

Content in Reddit is grouped by topics called "subreddits". Content, called "submissions", is fetched by calling the `subreddit` method of the connection object (which is our `reddit` variable) with an argument that matches an actual topic. 

We also need to append a further method call to a "subinstance", such as one of the following:

- controversial
- gilded
- hot
- new
- rising
- top

One of the submission objects members is `title`. Fetch and print 10 submission titles from the 'learnpython' subreddit using one of the subinstances above:

In [9]:
for submission in reddit.subreddit('learnpython').hot(limit=10):
    print(submission.title)

Ask Anything Monday - Weekly Thread
Best Python course with lots of exercises and quizzes?
How to write a line that allows a customer to enter a number between 1-100
What to do when the learning to code process starts to get boring and tiring ?
How to activate multiple devices with differing Mac addresses in one python code?
FastAPI and SwiftUI
Get elements of list in the form (i, j, k) separately, divided by comma
How do python "reads" new paragraf (enter)
I cannot import my module.
Best paid courses for Python?


Now retrieve 10 authors:

In [10]:
for submission in reddit.subreddit('learnpython').hot(limit=10):
    print(submission.author)

AutoModerator
pixieshit
SithAbsolutes
Tech-HRT
Relevant-Arachnid402
Regular-Hospital-281
Char-car92
JofoBoss
That_guy3475
Formal-Sale-9818


Note that we obtained the titles and authors from separate API calls. Can we expect these to correspond to the same submissions? If not, how could we gurantee that they do?

- SN_note: The comment above acknowledges the need to ensure that the titles and authors align correctly with the same submissions. It's possible that due to the asynchronous nature of API calls or other factors, the data retrieved from one call might not perfectly match the data from another call.

In [11]:
# submissions = reddit.subreddit('learnpython')

for submission in reddit.subreddit('learnpython').hot(limit=10):
    print("Title: ", submission.title)
    print("Author: ", submission.author)
    print("----------------------------------")

Title:  Ask Anything Monday - Weekly Thread
Author:  AutoModerator
----------------------------------
Title:  Best Python course with lots of exercises and quizzes?
Author:  pixieshit
----------------------------------
Title:  How to write a line that allows a customer to enter a number between 1-100
Author:  SithAbsolutes
----------------------------------
Title:  What to do when the learning to code process starts to get boring and tiring ?
Author:  Tech-HRT
----------------------------------
Title:  How to activate multiple devices with differing Mac addresses in one python code?
Author:  Relevant-Arachnid402
----------------------------------
Title:  FastAPI and SwiftUI
Author:  Regular-Hospital-281
----------------------------------
Title:  Get elements of list in the form (i, j, k) separately, divided by comma
Author:  Char-car92
----------------------------------
Title:  How do python "reads" new paragraf (enter)
Author:  JofoBoss
----------------------------------
Title:  I can

In [12]:
submissions=reddit.subreddit('learnpython')

Why doesn't the next cell produce output?
- SN_answer: No output is produced because the submission.comments attribute doesn't directly contain the comments for each submission. Instead, it's a PRAW object that you need to iterate through to access the comments.

In [13]:
for submission in submissions:
    print(submission.comments)

TypeError: 'Subreddit' object is not iterable

Print two comments associated with each of these submissions:

In [14]:
# Fetch the 2 hottest (most upvoted) submissions from the 'learnpython' subreddit
submissions = reddit.subreddit('learnpython').hot(limit=2)

# Iterate through each submission in the fetched list
for submission in submissions:
    
    # Get a list of the comments for the current submission
    # comments = list(submission.comments)
    
    # Or you can get the specific number of comments
    # comments2 = submission.comments.list()[:2]
    
    # Iterate through all the comments and print their body text
    for comment in submission.comments.list()[:2]:
        print('Comment:', comment.body)
        print('-----------------------------------------------------------------')

Comment: I would love to learn Python, and I'm a bit overwhelmed and the bevy of options online.  
Are any of you self-taught and where do you recommend starting as a beginner?
-----------------------------------------------------------------
Comment: Practicing codewars everybody for at least one hour but just feels like i’m not improving… :(
-----------------------------------------------------------------
Comment: https://programming-23.mooc.fi from the University of Helsinki

Honestly, far better for complete beginners than ATBS. ATBS is great for the later parts where the projects come, but the early parts are too short for complete beginners.

The MOOC I've linked above addresses exactly your "repeated exercises" statement as it is majorly based on having you program.
-----------------------------------------------------------------
Comment: harvard university's              
CS50’s Introduction to Programming with Python                     

https://cs50.harvard.edu/python/2022

Referring to the API documentation, explore the submissions object and print some interesting data:

In [15]:
# Fetch a list of the latest submissions from a specific subreddit
submissions = reddit.subreddit('learnpython').new(limit=5)  # Limiting to the latest 5 submissions

# Iterate through the submissions and print interesting data
for submission in submissions:
    print("Submission Title:", submission.title)  # Print the title of the submission
    print("Submission Author:", submission.author)  # Print the author of the submission
    print("Submission Score:", submission.score)  # Print the submission score (upvotes - downvotes)
    print("Number of Comments:", submission.num_comments)  # Print the number of comments on the submission
    print("URL:", submission.url)  # Print the URL of the submission if it's a link
    print("Self Text:\n", submission.selftext)  # Print the self text of the submission if it's a text post
    print("-------------------------------------------------------------")

Submission Title: How to activate multiple devices with differing Mac addresses in one python code?
Submission Author: Relevant-Arachnid402
Submission Score: 2
Number of Comments: 1
URL: https://www.reddit.com/r/learnpython/comments/16tplnv/how_to_activate_multiple_devices_with_differing/
Self Text:
 I am trying to make a code more efficient by making python code activate multiple devices at once. Currently, this code is being used to run 5 of the same devices, same instructions, but the script is being run separately for each device. The code starts out with system\_init() and is connected to a Mac address, but I wanted to know if it was possible to define the system\_init() function with these devices and their Mac addresses to run them all simultaneously? This code is currently being run in Fenics dolfin. I am unsure how the devices are connected to the computer system, I am pretty sure it is by some physical port and not bluetooth. An example of the code I am envisioning looks some

#### Posting to Reddit

To be able to post to your Reddit account (i.e. contribute submissions), you need to connect to the API with read/write privilege. This requires an *authorised instance*, which is obtained by including your Reddit user name and password in the connection request: 

In [16]:
reddit = praw.Reddit(client_id='my client id',
                     client_secret='my client secret',
                     user_agent='my user agent',
                     username='my username',
                     password='my password')
print(reddit.read_only)  # Output: False

False


You could hide these last two credentials by adding them to your JSON file and then reading all five values at once.

In [None]:
path_auth = 'auth_reddit.json'
auth = json.loads(open(path_auth).read())
pp = pprint.PrettyPrinter(indent=4)

pp.pprint(auth)

In [18]:
my_user_agent = auth['my_user_agent']
my_client_id = auth['my_client_id']
my_client_secret = auth['my_client_secret']
my_username = auth['my_username']
my_password = auth['my_password']

In [19]:
reddit = praw.Reddit(client_id='my_client_id',
                     client_secret='my_client_secret',
                     user_agent='my_user_agent',
                     username='my_username',
                     password='my_password')

print('Read-only = ' + str(reddit.read_only))  # Output: False

Read-only = False




---



---



> > > > > > > > > © 2023 Institute of Data


---



---



