# PyConversations: A Reddit-based Example

The following is a tutorial notebook that demonstrates how to use `pyconversations` with Reddit data.

The first step will be to obtain some data. In order to do so, you will need to configure a personal application via your Reddit account's [App Preferences](https://www.reddit.com/prefs/apps). You'll want to set up a personal usage script. See the _Getting Access_ portion of [this blog](https://towardsdatascience.com/how-to-use-the-reddit-api-in-python-5e05ddfd1e5c) for additional instructions/visualization.

In [14]:
import pprint

from pyconversations.convo import Conversation
from pyconversations.message import RedditPost

## Data Sample

Before demonstating how we can use `pyconversations` for pre-processing and analysis, we need to obtain a data sample. 
To do so, we'll be using a package called [praw](https://github.com/praw-dev/praw).

In [1]:
import praw

In [2]:
# Private information
CLIENT_ID = ''  # this should be the 'personal use script' on your App Preferences
SECRET_TOKEN = ''  # this should be the 'secret' on your App Preferences
USER_AGENT = ''  # a custom name for your application for the User-Agent parameter in the request headers; gives a brief app description to Reddit

In [4]:
# configure a read-only praw.Reddit instance
reddit = praw.Reddit(
    client_id=CLIENT_ID,
    client_secret=SECRET_TOKEN,
    user_agent=USER_AGENT,
)

reddit, reddit.read_only

(<praw.reddit.Reddit at 0x1048c5a10>, True)

In [10]:
# obtain a sub-reddit of interest
sub_name = 'Drexel'
subreddit = reddit.subreddit(sub_name)
subreddit.title  # PRAW is lazy so won't request till we ask for an attribute
pprint.pprint(vars(subreddit))

{'_fetched': True,
 '_path': 'r/Drexel/',
 '_reddit': <praw.reddit.Reddit object at 0x1048c5a10>,
 'accounts_active': 73,
 'accounts_active_is_fuzzed': False,
 'active_user_count': 73,
 'advertiser_category': 'College / University',
 'all_original_content': False,
 'allow_chat_post_creation': False,
 'allow_discovery': True,
 'allow_galleries': True,
 'allow_images': True,
 'allow_polls': True,
 'allow_predictions': False,
 'allow_predictions_tournament': False,
 'allow_videogifs': True,
 'allow_videos': True,
 'banner_background_color': '#ffffff',
 'banner_background_image': 'https://styles.redditmedia.com/t5_2qh6g/styles/bannerBackgroundImage_250phz39qiw31.jpg?width=4000&s=0738f848906e0a9602210260124eb92e697c33e0',
 'banner_img': '',
 'banner_size': None,
 'can_assign_link_flair': False,
 'can_assign_user_flair': True,
 'collapse_deleted_comments': False,
 'comment_score_hide_mins': 0,
 'community_icon': 'https://styles.redditmedia.com/t5_2qh6g/styles/communityIcon_ng3537akpiw31.png?

In [11]:
# get the top submission via 'hot'
top_submission = list(subreddit.hot(limit=1))[0]
top_submission.title

pprint.pprint(vars(top_submission))

{'_comments_by_id': {},
 '_fetched': False,
 '_reddit': <praw.reddit.Reddit object at 0x1048c5a10>,
 'all_awardings': [{'award_sub_type': 'GLOBAL',
                    'award_type': 'global',
                    'awardings_required_to_grant_benefits': None,
                    'coin_price': 80,
                    'coin_reward': 0,
                    'count': 1,
                    'days_of_drip_extension': 0,
                    'days_of_premium': 0,
                    'description': 'Everything is better with a good hug',
                    'end_date': None,
                    'giver_coin_reward': 0,
                    'icon_format': 'PNG',
                    'icon_height': 2048,
                    'icon_url': 'https://i.redd.it/award_images/t5_q0gj4/ks45ij6w05f61_oldHugz.png',
                    'icon_width': 2048,
                    'id': 'award_8352bdff-3e03-4189-8a08-82501dd8f835',
                    'is_enabled': True,
                    'is_new': False,
             

In [12]:
# get all the comments on this submission
all_comments = top_submission.comments.list()

print(len(all_comments))

all_comments[0].score
pprint.pprint(vars(all_comments[0]))

53
{'_fetched': True,
 '_reddit': <praw.reddit.Reddit object at 0x1048c5a10>,
 '_replies': <praw.models.comment_forest.CommentForest object at 0x10493e910>,
 '_submission': Submission(id='ky1j02'),
 'all_awardings': [],
 'approved_at_utc': None,
 'approved_by': None,
 'archived': False,
 'associated_award': None,
 'author': None,
 'author_flair_background_color': '',
 'author_flair_css_class': None,
 'author_flair_template_id': None,
 'author_flair_text': None,
 'author_flair_text_color': 'dark',
 'awarders': [],
 'banned_at_utc': None,
 'banned_by': None,
 'body': '[deleted]',
 'body_html': '<div class="md"><p>[deleted]</p>\n</div>',
 'can_gild': True,
 'can_mod_post': False,
 'collapsed': False,
 'collapsed_because_crowd_control': None,
 'collapsed_reason': None,
 'comment_type': None,
 'controversiality': 0,
 'created': 1612919594.0,
 'created_utc': 1612890794.0,
 'depth': 0,
 'distinguished': None,
 'downs': 0,
 'edited': False,
 'gilded': 0,
 'gildings': {},
 'id': 'gmpta7v',
 'is

## Integration with `pyconversations`

All that's left to do is plug our data directly into `pyconversations`!

In [22]:
# construct a conversation container
conv = Conversation()

In [23]:
# parse our root submission
cons_params = {
    'platform': 'Reddit',  # what platform this data is from
    'lang_detect': True,  # whethher to enable the language detection module on the txt
    'uid': top_submission.id,  # the unique identifier of the post
    'author': top_submission.author.name if top_submission.author is not None else None,  # name of user
    'created_at': RedditPost.parse_datestr(top_submission.created),  # creation timestamp
    'text': (top_submission.title + ': ' + top_submission.selftext).strip()  # text of post
}
top_post = RedditPost(**cons_params)
top_post

UniMessage(Reddit::AstroGnat::2021-01-15 22:13:56::Sublet Thread - Spring Summer: Here's the thread f::tags=)

In [24]:
# add our post to the conversation
conv.add_post(top_post)

In [25]:
# which we can easily access via the .posts property, to verify inclusion
conv.posts

{'ky1j02': UniMessage(Reddit::AstroGnat::2021-01-15 22:13:56::Sublet Thread - Spring Summer: Here's the thread f::tags=)}

In [27]:
# iterate through comments and add them to the conversation
for com in all_comments:
    conv.add_post(RedditPost(**{
        'platform': 'Reddit',  # what platform this data is from
        'lang_detect': True,  # whethher to enable the language detection module on the txt
        'uid': com.id,  # the unique identifier of the post
        'author': com.author.name if com.author is not None else None,  # name of user
        'created_at': RedditPost.parse_datestr(com.created),  # creation timestamp
        'text': com.body.strip(),  # text of post
        'reply_to': {com.parent_id.replace('t1_', '').replace('t3_', '')}  # set of IDs replied to
    }))

conv.messages

54

### Sub-Conversation Segmentation

In [28]:
# seperate disjoint conversations (there is likely just the one with a full query from the site...)
segs = conv.segment()

len(segs)

1

### (Detected) Language Distribution

In [29]:
from collections import Counter

lang_dist = Counter([post.lang for post in conv.posts.values()])
lang_dist

Counter({'en': 51, 'und': 3})

### Conversation-Level Redaction

Using `Conversation.redact()` produces a thread that is cleaned of user-specific information. 
This is conversationally-scoped, so all usernames are first enumerated (either from author names or from in-text reference for Reddit and Twitter) and then user mentions (and author names) are replaced by `USER{\d}` where `{\d}` is the integer assigned to that username during the enumeration stage.

Here's a demonstration:

In [30]:
# pre-redaction 
names = {post.author for post in conv.posts.values()}
len(names), names

(45,
 {'ADecentURL',
  'Active-Recipe-1681',
  'AstroGnat',
  'Coolkidinvestor',
  'Dysepher',
  'Environmental-Ant154',
  'Icy-Enthusiasm6983',
  'Jodinajo',
  'Mars_Tsar',
  'MrDrTurtl3475',
  'MustacheCash_Stash',
  'No-Mathematician4625',
  None,
  'RafsVeryOwn',
  'Revolutionary-Cry314',
  'Sad-Lampshade',
  'Spartacous1991',
  'SupermarketHour944',
  'Theregoesbird',
  'Tricky-Bet7267',
  'ZydecoVivo',
  'aant1100',
  'ambererich',
  'androserea',
  'anurag19031998',
  'ariella2020',
  'bored-af-nerd',
  'caaaaaaaaaaaaaaaarl',
  'cantrelate123456',
  'cherryblossoms321',
  'dsbuddy',
  'emoney1920',
  'fluffybeluga10',
  'gorillagripgirl',
  'halfbakedrealness',
  'jserrr',
  'jule11a',
  'machinaOverlord',
  'mcat-2105',
  'nathaling',
  'rugrus_andrew',
  'ruhanikenguru',
  'somnus12345',
  'ssjmac',
  'vyha1821'})

In [31]:
# redaction step
conv.redact()

In [32]:
# post-redaction
names = {post.author for post in conv.posts.values()}
len(names), names

(45,
 {None,
  'USER0',
  'USER1',
  'USER10',
  'USER11',
  'USER12',
  'USER13',
  'USER14',
  'USER15',
  'USER16',
  'USER17',
  'USER18',
  'USER19',
  'USER2',
  'USER20',
  'USER21',
  'USER22',
  'USER23',
  'USER24',
  'USER25',
  'USER26',
  'USER27',
  'USER28',
  'USER29',
  'USER3',
  'USER30',
  'USER31',
  'USER32',
  'USER33',
  'USER34',
  'USER35',
  'USER36',
  'USER37',
  'USER38',
  'USER39',
  'USER4',
  'USER40',
  'USER41',
  'USER42',
  'USER43',
  'USER5',
  'USER6',
  'USER7',
  'USER8',
  'USER9'})

### Saving and Loading from the universal format

In [34]:
import json

In [35]:
# saving a conversation to disk
# alternatively: save as a JSONLine file, where each line is a conversation!
json.dump(conv.to_json(), open('reddit_conv.json', 'w+'))

In [37]:
# reloading directly from the JSON
conv_reloaded = Conversation.from_json(json.load(open('reddit_conv.json')), RedditPost)
conv_reloaded.messages

54

### Properties for Analysis

In [38]:
conv.messages  # number of messages in the conversation

54

In [39]:
conv.connections  # number of reply connections

53

In [40]:
conv.users  # number of unique users

45

In [41]:
conv.chars  # character size of the entire conversation

15718

In [44]:
conv.tokens  # length of conversation, in tokens

6194

In [45]:
conv.token_types  # number of unique tokens used in convo

853

In [46]:
conv.sources  # IDs of source messages (messages without a reply action, messages that originate dialog)

{'ky1j02'}

In [47]:
conv.density  # density of the conversation, when represented as a graph

0.037037037037037035

In [50]:
conv.degree_hist  # Returns the degree (# of replies received) histogram of this conversation

[0,
 42,
 8,
 3,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1]

In [51]:
conv.tree_depth  # height of the conversational tree

3

In [52]:
conv.tree_width  # width of a depth level is the # of posts at that depth level (distance from source), tree width is the max width of any depth level

39

In [53]:
conv.start_time, conv.end_time, conv.duration  # time-related properties of the conversation

(datetime.datetime(2021, 1, 15, 22, 13, 56),
 datetime.datetime(2021, 6, 26, 2, 24, 27),
 13925431.0)

In [56]:
conv.time_series  # times of posting, in order

[1610766836.0,
 1610776960.0,
 1611091403.0,
 1611821533.0,
 1612576845.0,
 1612609103.0,
 1612700963.0,
 1612919594.0,
 1612920076.0,
 1613366588.0,
 1613710652.0,
 1613733214.0,
 1613822136.0,
 1613990538.0,
 1614127850.0,
 1614133364.0,
 1614291198.0,
 1614329240.0,
 1614766876.0,
 1614831504.0,
 1614990207.0,
 1614990487.0,
 1614990614.0,
 1615008458.0,
 1615014199.0,
 1615014340.0,
 1615074270.0,
 1615328796.0,
 1615413147.0,
 1615544908.0,
 1615610208.0,
 1615687393.0,
 1616625364.0,
 1616642294.0,
 1616979001.0,
 1616987414.0,
 1617059352.0,
 1617450778.0,
 1617753443.0,
 1618916480.0,
 1619494051.0,
 1619687623.0,
 1620259295.0,
 1621052070.0,
 1621172220.0,
 1621764025.0,
 1622192369.0,
 1622942841.0,
 1622952037.0,
 1623118670.0,
 1623657263.0,
 1624192319.0,
 1624338889.0,
 1624688667.0]

In [57]:
conv.text_stream  # text of posts, in temporal order

["Sublet Thread - Spring Summer: Here's the thread for Spring Summer sublets. Please keep all posts related to sublets in this thread, as I'll remove any posts that aren't here.",
 'My roommate and I are looking to sublet our apartment for the rest of our lease (ends in August 2021 with option for you to renew). 2 bed. 1 bath. Located at 36th & Hamilton. Total rent is $1850 per month plus utilities. Very large ~1200 sqft apartment on the ground floor with a cute backyard. Pet friendly. \n\nDM me for questions or more info.',
 '[deleted]',
 'Hiya everyone! I am currently looking to relet my apartment spot at University Crossings. It is a 2 bedroom with 2 people per room, 2 bathrooms and is furnished. The specific room plan is called 2 bed-2 bath C Double standard twin. Currently, there are 3 other guys in the apartment. The lease would be from March end to Sep 2021. The lease can also be ended for reasons pertaining to co-op (is co-op friendly). The apartment is super close to many of t