# PyConversations: A Reddit-based Example

The following is a tutorial notebook that demonstrates how to use `pyconversations` with Reddit data.

The first step will be to obtain some data. In order to do so, you will need to configure a personal application via your Reddit account's [App Preferences](https://www.reddit.com/prefs/apps). You'll want to set up a personal usage script. See the _Getting Access_ portion of [this blog](https://towardsdatascience.com/how-to-use-the-reddit-api-in-python-5e05ddfd1e5c) for additional instructions/visualization.

In [1]:
from pprint import pprint

from pyconversations.convo import Conversation
from pyconversations.message import RedditPost

## Data Sample

Before demonstating how we can use `pyconversations` for pre-processing and analysis, we need to obtain a data sample. 
To do so, we'll be using a package called [praw](https://github.com/praw-dev/praw).

In [2]:
import praw

In [3]:
# Private information
CLIENT_ID = 'K9tFd5p7hRQkDQ'  # this should be the 'personal use script' on your App Preferences
SECRET_TOKEN = 'iU6qnd5KmfackIGwv0-w6oKR_V6iyQ'  # this should be the 'secret' on your App Preferences
USER_AGENT = 'HSHBot/0.0.1'  # a custom name for your application for the User-Agent parameter in the request headers; gives a brief app description to Reddit

In [4]:
# configure a read-only praw.Reddit instance
reddit = praw.Reddit(
    client_id=CLIENT_ID,
    client_secret=SECRET_TOKEN,
    user_agent=USER_AGENT,
)

reddit, reddit.read_only

Version 7.3.0 of praw is outdated. Version 7.4.0 was released Friday July 30, 2021.


(<praw.reddit.Reddit at 0x120bd1590>, True)

In [5]:
# obtain a sub-reddit of interest
sub_name = 'Drexel'
subreddit = reddit.subreddit(sub_name)

subreddit.title  # PRAW is lazy so won't request till we ask for an attribute

pprint(vars(subreddit))

{'_fetched': True,
 '_path': 'r/Drexel/',
 '_reddit': <praw.reddit.Reddit object at 0x120bd1590>,
 'accounts_active': 50,
 'accounts_active_is_fuzzed': False,
 'active_user_count': 50,
 'advertiser_category': 'College / University',
 'all_original_content': False,
 'allow_chat_post_creation': False,
 'allow_discovery': True,
 'allow_galleries': True,
 'allow_images': True,
 'allow_polls': True,
 'allow_predictions': False,
 'allow_predictions_tournament': False,
 'allow_videogifs': True,
 'allow_videos': True,
 'banner_background_color': '#ffffff',
 'banner_background_image': 'https://styles.redditmedia.com/t5_2qh6g/styles/bannerBackgroundImage_250phz39qiw31.jpg?width=4000&s=0738f848906e0a9602210260124eb92e697c33e0',
 'banner_img': '',
 'banner_size': None,
 'can_assign_link_flair': False,
 'can_assign_user_flair': True,
 'collapse_deleted_comments': False,
 'comment_score_hide_mins': 0,
 'community_icon': 'https://styles.redditmedia.com/t5_2qh6g/styles/communityIcon_ng3537akpiw31.png?

In [6]:
# get the top submission via 'hot'
top_submission = list(subreddit.hot(limit=1))[0]
top_submission.title

pprint(vars(top_submission))

{'_comments_by_id': {},
 '_fetched': False,
 '_reddit': <praw.reddit.Reddit object at 0x120bd1590>,
 'all_awardings': [],
 'allow_live_comments': False,
 'approved_at_utc': None,
 'approved_by': None,
 'archived': False,
 'author': Redditor(name='AstroGnat'),
 'author_flair_background_color': None,
 'author_flair_css_class': 'textflair',
 'author_flair_richtext': [],
 'author_flair_template_id': '59c1edbc-5345-11e1-9edb-12313b08a511',
 'author_flair_text': 'Alumni | Digital Media',
 'author_flair_text_color': 'dark',
 'author_flair_type': 'text',
 'author_fullname': 't2_3gjpr',
 'author_is_blocked': False,
 'author_patreon_flair': False,
 'author_premium': False,
 'awarders': [],
 'banned_at_utc': None,
 'banned_by': None,
 'can_gild': False,
 'can_mod_post': False,
 'category': None,
 'clicked': False,
 'comment_limit': 2048,
 'comment_sort': 'confidence',
 'content_categories': None,
 'contest_mode': False,
 'created': 1626638983.0,
 'created_utc': 1626638983.0,
 'discussion_type': N

In [7]:
# get all the comments on this submission
all_comments = top_submission.comments.list()

print(len(all_comments))

all_comments[0].score
pprint(vars(all_comments[0]))

15
{'_fetched': True,
 '_reddit': <praw.reddit.Reddit object at 0x120bd1590>,
 '_replies': <praw.models.comment_forest.CommentForest object at 0x1239053d0>,
 '_submission': Submission(id='omya9w'),
 'all_awardings': [],
 'approved_at_utc': None,
 'approved_by': None,
 'archived': False,
 'associated_award': None,
 'author': Redditor(name='Fun-Obligation5729'),
 'author_flair_background_color': None,
 'author_flair_css_class': None,
 'author_flair_richtext': [],
 'author_flair_template_id': None,
 'author_flair_text': None,
 'author_flair_text_color': None,
 'author_flair_type': 'text',
 'author_fullname': 't2_883n07bs',
 'author_is_blocked': False,
 'author_patreon_flair': False,
 'author_premium': False,
 'awarders': [],
 'banned_at_utc': None,
 'banned_by': None,
 'body': 'I have a room available from September 1st- August 31st at 3406 race '
         'street. It’s 775$ . Let me know if anyone’s interested.',
 'body_html': '<div class="md"><p>I have a room available from September 1s

## Integration with `pyconversations`

All that's left to do is plug our data directly into `pyconversations`!

In [8]:
# construct a conversation container
conv = Conversation()

In [9]:
# parse our root submission
cons_params = {
    'lang_detect': True,  # whethher to enable the language detection module on the txt
    'uid': top_submission.id,  # the unique identifier of the post
    'author': top_submission.author.name if top_submission.author is not None else None,  # name of user
    'created_at': RedditPost.parse_datestr(top_submission.created),  # creation timestamp
    'text': (top_submission.title + '\n\n' + top_submission.selftext).strip()  # text of post
}
top_post = RedditPost(**cons_params)
top_post

RedditPost(Reddit::AstroGnat::2021-07-18 16:09:43::Sublet Thread - Fall Winter

Here's the thread for::tags=)

In [10]:
# add our post to the conversation
conv.add_post(top_post)

In [11]:
# which we can easily access via the .posts property, to verify inclusion
conv.posts

{'omya9w': RedditPost(Reddit::AstroGnat::2021-07-18 16:09:43::Sublet Thread - Fall Winter
 
 Here's the thread for::tags=)}

In [12]:
# iterate through comments and add them to the conversation
for com in all_comments:
    conv.add_post(RedditPost(**{
        'lang_detect': True,  # whethher to enable the language detection module on the txt
        'uid': com.id,  # the unique identifier of the post
        'author': com.author.name if com.author is not None else None,  # name of user
        'created_at': RedditPost.parse_datestr(com.created),  # creation timestamp
        'text': com.body.strip(),  # text of post
        'reply_to': {com.parent_id.replace('t1_', '').replace('t3_', '')}  # set of IDs replied to
    }))

len(conv.posts)

16

### Sub-Conversation Segmentation

In [13]:
# seperate disjoint conversations (there is likely just the one with a full query from the site...)
segs = conv.segment()

len(segs)

1

### (Detected) Language Distribution

In [14]:
from collections import Counter

lang_dist = Counter([post.lang for post in conv.posts.values()])
lang_dist

Counter({'en': 16})

### Conversation-Level Redaction

Using `Conversation.redact()` produces a thread that is cleaned of user-specific information. 
This is conversationally-scoped, so all usernames are first enumerated (either from author names or from in-text reference for Reddit and Twitter) and then user mentions (and author names) are replaced by `USER{\d}` where `{\d}` is the integer assigned to that username during the enumeration stage.

Here's a demonstration:

In [15]:
# pre-redaction 
names = {post.author for post in conv.posts.values()}

len(names), names

(16,
 {'According-Rate-2705',
  'AstroGnat',
  'Fit_Web_7741',
  'Fun-Obligation5729',
  'Hasta_La_Pasta827',
  'HumbleAbodee',
  'asapmeelz',
  'ghanshani_ritik',
  'makkirch',
  'memeboi2002',
  'sanjubee',
  'simonest27',
  'starryknight16',
  'thecalk',
  'turtlesturtlesduck',
  'viettran127'})

In [16]:
# redaction step
# conv.redact()

In [17]:
# post-redaction
names = {post.author for post in conv.posts.values()}
len(names), names

(16,
 {'According-Rate-2705',
  'AstroGnat',
  'Fit_Web_7741',
  'Fun-Obligation5729',
  'Hasta_La_Pasta827',
  'HumbleAbodee',
  'asapmeelz',
  'ghanshani_ritik',
  'makkirch',
  'memeboi2002',
  'sanjubee',
  'simonest27',
  'starryknight16',
  'thecalk',
  'turtlesturtlesduck',
  'viettran127'})

### Saving and Loading from the universal format

In [18]:
import json

In [19]:
# saving a conversation to disk
# alternatively: save as a JSONLine file, where each line is a conversation!
json.dump(conv.to_json(), open('reddit_conv.json', 'w+'))

In [20]:
# reloading directly from the JSON
conv_reloaded = Conversation.from_json(json.load(open('reddit_conv.json')))
len(conv_reloaded.posts)

16

### Feature Extraction

In [21]:
from pyconversations.feature_extraction import ConversationVectorizer
from pyconversations.feature_extraction import PostVectorizer
from pyconversations.feature_extraction import UserVectorizer

In [22]:
convs = True
# convs = False

users = True
# users = False

posts = True

normalization = 'standard'

cv = ConversationVectorizer(normalization=normalization, agg_user_fts=users, agg_post_fts=posts, include_source_user=True)
pv = PostVectorizer(normalization=normalization, include_conversation=convs, include_user=users)
uv = UserVectorizer(normalization=normalization, agg_conv_fts=convs, agg_post_fts=posts)

In [23]:
pv.fit(conv=conv_reloaded)
cv.fit(conv=conv_reloaded)
uv.fit(conv=conv_reloaded)

<pyconversations.feature_extraction.extractors.UserVectorizer at 0x1238fe110>

In [24]:
cxs = cv.transform(conv=conv_reloaded)
pxs = pv.transform(conv=conv_reloaded)
uxs = uv.transform(conv=conv_reloaded)

pprint(cxs.shape)
# pprint(cxs)

pprint(pxs.shape)
# pprint(pxs)

pprint(uxs.shape)

(1, 3225)
(16, 3317)
(16, 460)


In [26]:
from pyconversations.feature_extraction.post import get_all as post_all
from pyconversations.feature_extraction.post_in_conv import get_all as pic_all
from pyconversations.feature_extraction.conv import get_all as conv_all
from pyconversations.feature_extraction.user_in_conv import get_all as uic_all

In [27]:
ix = -1
pid = conv.time_order()[ix]
post = conv.posts[pid]

# post
post.text

'Hi all my name is Brianna and I secured the lease to this off campus property to move in for September and I am looking for another female roommate! The place is a 2 BDR 1BTH $1350 per month ($697.50 per person per month), the building is newly renovated and we would be the first people to live there! I am a 20 y/o female third year Finance major and I would love to have you as my roomie. Please PM me if you are interested.'

In [29]:
# static-post analysis (i.e., anything that can be extracted from a post in isolation)
x = post_all(post, ignore_keys={'type_frequency', 'tokens'})

print(len(x))

x

25


{'is_source': False,
 'author': 'Fit_Web_7741',
 'language': 'en',
 'platform': 'Reddit',
 'mixing_k1': 0.99001,
 'mixing_theta': 0.8516234384055628,
 'mixing_entropy': 0.7593498211510719,
 'mixing_N_avg': 27.02854747291602,
 'mixing_M_avg': 182.16183999999998,
 '?_count': 0,
 '!_count': 2,
 'char_count': 427,
 'emoji_count': 0,
 'hashtag_count': 0,
 'mention_count': 0,
 'degree_out': 1,
 'punct_count': 6,
 'token_count': 184,
 'type_count': 73,
 'uppercase_count': 18,
 'url_count': 0,
 'emojis': [],
 'hashtags': [],
 'mentions': [],
 'urls': []}

In [30]:
# post within conversation, including static features
x = pic_all(post, conv, ignore_keys={'type_frequency', 'tokens'})

print(len(x))

x

98


{'is_leaf': True,
 'is_internal': False,
 'is_author_source_author': False,
 'is_source': False,
 'relative_age': 2216614.0,
 'response_time': 2216614.0,
 'avg_token_entropy_after-ancestors': 0.45328454299820103,
 'avg_token_entropy_after-before': 0.40250145750045696,
 'avg_token_entropy_after-children': 0.4569831546342704,
 'avg_token_entropy_after-descendants': 0.4569831546342704,
 'avg_token_entropy_after-full': 0.40250145750045696,
 'avg_token_entropy_after-parents': 0.45328454299820103,
 'avg_token_entropy_after-siblings': 0.403718282170484,
 'avg_token_entropy_ancestors-after': 0.4211230570003033,
 'avg_token_entropy_ancestors-before': 0.3829224458494741,
 'avg_token_entropy_ancestors-children': 0.4211230570003033,
 'avg_token_entropy_ancestors-descendants': 0.4211230570003033,
 'avg_token_entropy_ancestors-full': 0.3829224458494741,
 'avg_token_entropy_ancestors-parents': 0.4211230570003033,
 'avg_token_entropy_ancestors-siblings': 0.3829224458494741,
 'avg_token_entropy_before-

In [31]:
# user-level analysis within a single conversation
x = uic_all(post.author, conv)

print(len(x))

x

461


{'is_source_author': False,
 'mixing_k1': 0.99001,
 'mixing_theta': 0.8516234384055628,
 'mixing_entropy': 0.7593498211510719,
 'mixing_N_avg': 27.02854747291602,
 'mixing_M_avg': 182.16183999999998,
 'avg_token_entropy': 0.40250145750045696,
 'post_min_relative_age': 2216614.0,
 'post_max_relative_age': 2216614.0,
 'post_mean_relative_age': 2216614.0,
 'post_median_relative_age': 2216614.0,
 'post_std_relative_age': 1.0,
 'post_min_response_time': 2216614.0,
 'post_max_response_time': 2216614.0,
 'post_mean_response_time': 2216614.0,
 'post_median_response_time': 2216614.0,
 'post_std_response_time': 1.0,
 'post_min_avg_token_entropy_after-ancestors': 0.45328454299820103,
 'post_max_avg_token_entropy_after-ancestors': 0.45328454299820103,
 'post_mean_avg_token_entropy_after-ancestors': 0.45328454299820103,
 'post_median_avg_token_entropy_after-ancestors': 0.45328454299820103,
 'post_std_avg_token_entropy_after-ancestors': 1.0,
 'post_min_avg_token_entropy_after-before': 0.402501457500

In [35]:
# full conversational analysis 
x = conv_all(conv, ignore_keys={'type_frequency_distribution'})

print(len(x))

print(sorted(x.keys()))

# x

2770
['!_count', '?_count', 'author_source_author_count', 'char_count', 'degree', 'degree_in', 'degree_in_size_distribution', 'degree_out', 'degree_out_size_distribution', 'degree_size_distribution', 'density', 'depth_distribution', 'duration', 'emoji_count', 'hashtag_count', 'internal_count', 'leaf_count', 'mention_count', 'messages', 'mixing_M_avg', 'mixing_N_avg', 'mixing_entropy', 'mixing_k1', 'mixing_theta', 'post_max_!_count', 'post_max_?_count', 'post_max_avg_token_entropy_after-ancestors', 'post_max_avg_token_entropy_after-before', 'post_max_avg_token_entropy_after-children', 'post_max_avg_token_entropy_after-descendants', 'post_max_avg_token_entropy_after-full', 'post_max_avg_token_entropy_after-parents', 'post_max_avg_token_entropy_after-siblings', 'post_max_avg_token_entropy_ancestors-after', 'post_max_avg_token_entropy_ancestors-before', 'post_max_avg_token_entropy_ancestors-children', 'post_max_avg_token_entropy_ancestors-descendants', 'post_max_avg_token_entropy_ancestors