Skip to content

PRAW-Powered COmmunity & DomaIn tArgeted Link Scraper. A command-line interface script for collecting raw URLs from Reddit.

License

Notifications You must be signed in to change notification settings

nkuehnle/praw-codials

Repository files navigation

PRAW-CoDiaLS

A niche CLI tool built using the Python Reddit API Wrapper (PRAW) for Community & Domain-Targeted Link Scraping.

Written for Python 3 (3.6 required due to liberal use of fstrings). Third party modules needed: praw, pyaml, and pandas.

Installation

PRAW-CoDiaLS is available from either this repository or via PyPI:

Recommended:

$ pip install praw-codials

Alternatives:

Download the .whl file or .tar.gz file and then run the appropriate command.

$ pip install praw_codials-1.0.3-py3-none-any.whl -r requirements.txt
$ pip install praw-codials-1.0.3.tar.gz -r requirements.txt

You can also build a wheel locally from source to incorporate new changes.

python -r setup.py bdist_wheel

Usage

Valid Reddit OAuth is required for usage. See Reddit's guide on how to obtain this and set it up. In short you will need to provide a client_id, client_secret, username, password, and client_agent.

usage: praw-codials [-h] -s list,of,subs -d list,of,domains -o client_id,client_secret,password,username,user_agent [-p PATH] /path/to/save/output/ [-l LIMIT] #of posts to search [--new] [--controversial] [--hot] [--top] [--quiet] [--nocomments]

_Python Reddit API Wrapper (PRAW) for Community & Domain-Targeted Link Scraping._

  -h, --help            show this help message and exit
  -s SUBS, --subs SUBS  Subreddit(s) to target. (Comma-separate multiples)
  -d DOMAINS, --domains DOMAINS
                        Domain(s) to collect URLs from. (Comma-separate multiples)

  -o OAUTH, --oauth OAUTH
                        OAuth information, either comma separated values in order (client_id, client_secret, password, username, 
                        user_agent) or a path to a key/value file in YAML format.
 
  -p PATH, --path PATH  Path to save output files (Posts_[DATETIME].csv and Posts_[DATETIME].csv. Default: working directory
  -l LIMIT, --limit LIMIT
                        Maximum threads to check (cannot exceed 1000).
  -t TOP, --top TOP     Search top threads. Specify the timeframe to consider (hour, day, week, month, year, all)
  -c CONTROVERSIAL, --controversial CONTROVERSIAL
                        Search controversial threads. Specify the timeframe to consider (hour, day, week, month, year, all)
  --hot                 Search hot posts.
  -n, --new             Search new posts.
  -q, --quiet           Supress progress reports until jobs are complete.
  -x, --nocomments      Don't collect links in top-level commentsReduces performance limitations caused by the Reddit API
  --regex REGEX         Override automatically generated regular expressions. NOTE: Assumes escape characters are provided in such as way that the shell
                        pass a properly escaped literal string to python.

By default, regular expressions will be generated for each provided domain in the form "{PREFIX}{DOMAIN}{SUFFIX}" where:

  • PREFIX = (?:https?://)?(?:www.)?
  • SUFFIX = .com/?[^\s\)]*
  • DOMAIN is the original domain with all periods escaped

A check is also performed to make sure that the sustring '](' is not included to remove substrings that span two parts of a markdown link. In these cases, only the right half of the link is collected.

Implementation Details

By default, this tool will return URLs collected from both link submissions (the main post for each thread) and the top-level comments for either text or link submissions (self/link posts), but not their children. This can be optionally disabled at the command line (see below). In a future update, I plan to provide an argument for setting a comment recursion depth; however, any such features will drastically impact performance due to the Reddit API rate-limit.

On that train of thought, please note that Reddit enforces rate limits. This means that this script will likely check between 80-100 pieces of content per minute. To improve performance, this script opens multiple PRAW instances and makes use of the Python multi-threading module to gain a small performance boost. In my limited testing, this improved throughput by approximately 33% from ~65 posts/min to ~85 posts/min when enabling all subreddit search methods (hot/top (all)/new/controversial (all)) with the default post limit (1000) across two subreddits and two domains. This ammounts to checking approximately 8K posts and tens of thousands of comments).

To further limit requests, it tries to ensure that it minimizes the number of comments it could access twice (i.e. in Top and Hot) by storing lists of submission and comment IDs that have already been encountered.

Output reports the following statistics as columns of two separate multi-row CSV files (one for submissions and one for comments, if included):

  • Submissions: post author, post ID, title, url, subreddit, score, upvote ratio (note: these are approximate/obfuscated), and post flair
  • Comments: comment author,comment ID, body (including Markdown), subreddit, score, all of the above attributes as they pertain to the comment's parent submission/thread, and URL's obtained by simple RegEx (multiple entries/rows are generated if multiple links matching the target domain(s) are found in the text body)

If you think that I've missed an important attribute, please let me know!

License

PRAW-CoDiaLS is released under the MIT License. See LICENSE for details.

Contact

To report issues or contribute to this project, please contact me on the GitHub repo for this project.

About

PRAW-Powered COmmunity & DomaIn tArgeted Link Scraper. A command-line interface script for collecting raw URLs from Reddit.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages